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Abstract 

We consider the problem of communicating over a channel for which no mathematical model is specified, and the achievable 

rates are determined as a function of the channel input and output sequences known a-posteriori, without assuming any a-priori 

relation between them. In a previous paper we have shown that the empirical mutual information between the input and output 

p^ sequences is achievable without specifying the channel model, by using feedback and common randomness, and a similar result for 

I real-valued input and output alphabets. In this paper, we present a unifying framework which includes the two previous results as 

^— s particular cases. We characterize the region of rate functions which are achievable, and show that asymptotically the rate function 

fvj is equivalent to a conditional distribution of the channel input given the output. We present a scheme that achieves these rates 

with asymptotically vanishing overheads. 

^ I. Introduction 

f~~. This paper revisits the "individual channel" communication model |[l], which provides an alternative framework for com- 

munication over unknown channels. The communication setup is illustrated in Figure [T] An encoder sends an input sequence 
r . X e A"" into the channel. The output of the channel y e 3^" is determined in a completely arbitrary way which is unknown to 
l_^ the encoder and the decoder However, there is a perfect feedback link from the decoder to the encoder, and we also assume 
jy^ the existence of common randomness. Under these assumptions we would like to characterize a communication rate for the 
O channel. Clearly, since nothing is guaranteed with respect to the output, one cannot guarantee any positive communication rate 
a-priori, and achieve a vanishing error probability. Therefore, instead, we define a rate as a function of the specific input and 
,— I output sequences (i?cmp(x, y), termed a rate function). 
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Fig. 1. The individual channel communication setup 



The motivations for this communication model are elaborated upon in our initial paper |fT|, and will be briefly explained 
^ here through an example. We consider the example of the binary channel y^ — Xi ® ei, where e^ is an arbitrary sequence. 
The traditional way to deal with this channel would be by using the arbitrarily varying channels (AVC) framework ^. In this 
framework feedback is not considered, and the AVC capacity is the maximum reliable communication rate that can be attained 
irrespective of the choice, or distribution, of the state sequence (in this case e^). However, in order to obtain a positive capacity, 
it is necessary to place a constraint on e^. Suppose that we limit the maximum rate of errors to j-^^ei ^ e < eo < i, then 
by applying common randomness the AVC capacity becomes 1 — hb{eo). This result requires placing an a-priori constraint. 
Furthermore, because of the worst-case nature of the AVC capacity, the communication rate will not improve if e < eo, i.e. the 
channel is actually better than we have assumed. Shayevitz and Feder ||3l proposed to deal with this issue by using feedback, 
and have presented a scheme that without assuming any prior constraint on e, achieves the rate 1 — /ib(e). 

This result, and its extensions [4J allows us to replace a-priori constraints by the empirical distribution of the noise (or 
state) sequence that actually occurred, thus alleviating the worst-case assumptions. The result is that the rate is defined by the 
sequence (i.e. the channel). Still, we need to assume a channel model relating the input and the output. Since channel models 
are in many cases a coarse abstraction of reality, and in some cases may be completely unknown, the next step is to ask: can 
we do without the model, by, so to speak "extracting" this model from the empirical data? In doing so, we define the empirical 
rate function by using both the input and the output. This is a fundamental change with respect to the previous models, since 
the input is determined by the scheme itself 

In the previous paper fl\ we have shown that it is possible to attain the empirical mutual information i?cmp(x, y) = /(x; y), 
as well as the function i?cmp(x, y) = ^ log ^_ ^2/^ — y, where p is the empirical correlation factor The later function is suitable 



for channels with real-valued inputs and outputs. These rate functions are appealing since they are direct counterparts of 
statistical information measures. For the case of a discrete memoryless channel, the empirical mutual information over the 
sequences tends in probability to the statistical mutual information over the input and output random variables. The second 
function tends to the mutual information between two Gaussian random variables with the same correlation factor, and thus is 
optimal for Gaussian channels. These results generalize achievability results for compound channels and AVCs, and enable to 
easily re-derive the previously mentioned results ||3], iHJ, and even extend them |J_ Section VII.B]. However many questions 
are left open. For example, how can these functions be modified to include memory or take into account MIMO channels, and 
what is the set of achievable rate functions? Is there a general way to extend the concept of "empirical mutual information"? 
In addition, in the previous paper we have separated the discussion on the discrete and the continuous cases, from technical 
reasons, and the natural question that raises to mind is whether the two results can be put into a unified theory. 

The main objective of this paper is to define such a unifying theory, by first characterizing the set of achievable rate functions, 
presenting general communication schemes for achieving these rates with, and without feedback (where only in the first case, 
the communication rate is adaptive), and presenting a tighter analysis of the overheads related to universally achieving these 
rate functions. The new techniques used in this paper enable us to derive various rate functions and analyze the overhead (or 
rate loss) required for attaining them in a finite block length. We present refined proof techniques that lead to tighter bounds 
and re-derive, and improve over the previous results ID, ISj, ||6|. However note that the different proof techniques used in the 
previous work [l] are interesting on their own, and sometimes more intuitive. We will highlight the connections between the 
results in the sequel. 

II. Overview 

Following is a high level overview of the ideas and results presented in this paper. As mentioned above we would like to 
refrain from stating the channel model. We define the rate of a system using a "rate function" _Romp(x, y) of the input sequence 
X and output sequence y. We would like to find systems which guarantee attaining certain rate functions. 

The first step is to define what "attaining" a rate function means. We refer to two kinds of systems: fixed rate systems without 
feedback, and adaptive-rate systems using feedback. The adaptive rate systems guarantee that the transmitted rate would be 
at least i?cmp(x, y) while keeping a small probability of error, for any input and output sequence. I.e. this guarantee holds 
irrespective of the channel model. In the fixed rate case, since we cannot guarantee any positive rate a-priori (the Shannon 
capacity of the channel in Figure [T] is 0), the system only guarantees reliable communication when i?cmp > R (the event 
^omp < R can be considered as "outage"). Therefore the adaptive case is of more interest from a practical perspective. We 
allow unlimited common randomness between the encoder and the decoder, and in order to avoid circular definitions, we 



constrain the input distribution to a given prior Q. These definitions are stated formally and discussed in Section III 

In classical communication and information theory, one only considers the average error probability over the channel law 
and requires a certain static rate of communication, whereas here we require that the rate function would be specified per 
input-output pair x, y, and that a certain error probability would be achieved. This may be seen as an over-requirement, however 
note that every system has, in effect, a rate function: one can always look at all the cases where the input was a specific x and 
the output was a specific y and ask what was the actual rate of error free bits that was received in this case. Thus, we can 
consider the "rate function" as way for characterizing communication systems which is "channel independent". On the other 



hand, as we will see in Section IV-E with a small overhead, the rate function of any system can be attained with a fixed error 
probability. 

The first question we ask is - which rate functions are achievable (Section [TV]!? Theorem [T] gives a necessary and a sufficient 
condition for the achievability of a rate function (in the non-adaptive case), which are tight in the sense of the achieved rate 
for large block size n ^- cx). In an analogy to universal source coding, this theorem is equivalent to the Kraft inequality, 
stating which source encoders are feasible (in terms of the set of word lengths). Based on this result, we can characterize the 
"intrinsic redundancy", which is a property of any rate function, determining the redundancy that would be needed to achieve 
it (Theorem I2I1. Then, considering more general systems, it is shown that the good-put associated with a specific choice of 
x, y in any system, is in-fact an achievable rate function, and therefore can be achieved with an error probability as low as 
desired, per sequence, up to a small overhead in rate. 

The characterization of Theorem [T] is based on the CDF of the rate function with respect to the input distribution Q, which 
is inconvenient to handle. In SectionM we deal with the asymptotic behavior of rate functions, and show that asymptotically, 
the achievability of rate functions can be determined based on a simpler condition similar to the Chernoff bound (Theorem Hll. 
The main result of this section is Theorem |5] which shows that the maximum rate functions are asymptotically of the form 
^cmp = ^ log ( r)(x) ) for some conditional probability P(x|y). Thus, selecting rate functions is asymptotically equivalent to 
selecting conditional distributions P(x|y). Returning again to the analogy to source coding, this claim is similar to the claim 
that, due to Kraft inequality, every source encoder is defined by a probability distribution on the set of possible messages |7J. 



The set of achievable rate functions is rather arbitrary (like the set of possible encoders, in the analogy). In Section VI 



we discuss the problem of selecting the rate function, using several possible constructions. Each construction has a certain 



justification and results in a certain form. The first construction that we term "maximum likelihood construction" (Section VI-B I 



is based on taking the maximum of the form - log 



°8(x|y) 
Q(x) 



over a class of models 9. Achieving this rate function guarantees 



matching (or surpassing) the rate of any system operating over any of the channels in the model class. Another way to 



remove the arbitrariness (Section VI-E i is to limit the scope to rate functions defined based on a predefined set of parameters 
(for example the empirical second order moments, or zero order joint statistics). When the parameters can take only a sub- 
exponential number of values, the input and output sequences can be grouped into "types" of sequences having the same values 
of the empirical parameters. Theorem l6] determines the optimal rate function that can be obtained in this case. We particularize 
the result to the memoryless case, and present the best rate function that can be defined by zero order statistics (Lemma l5]l. 
This rate function can be also stated in terms of the "maximum likelihood" construction, and on the other hand is close to 
the empirical mutual information, which means that the empirical mutual information is essentially optimal (in terms of using 



the zero order statistics). A third way to define a rate function (Section VI-Fi is by taking another system as a reference and 
asking what is the maximum rate that can be achieved with a given decoding metric and a given prior, when the number 
of messages is allowed to vary - i.e. conditioned on a certain pair of input and output, how many messages can one send 
while still maintaining a small probability of error? In the rest of the paper we focus mainly on the "maximum likelihood' 
construction. 

The main strength of the "individual channel" approach is when the rate function can be obtained adaptively, without outage. 



Section VII focuses on rate adaptivity. In Section [VII- A we present a communication scheme that attains an adaptive rate using 
multiple iterations of rateless coding. Theorem IT] and its corollaries characterize the performance of the proposed rate adaptive 
scheme. The scheme is based on a decoding metric that must satisfy some conditions and needs to be specified later, and 
the rate function is given a s function of this metric. In what follows we substitute various metrics to obtain various rate 

we show that under a "causality" condition, the rate function -Rcmp = ^ log I rux) ) (which is 
'all rate functions) can be adaptively achieved (Theorem |8]l 



VII-E 



functions. In Section 

the asymptotical bound tor 

Next we focus on "maximum likelihood" rate functions (Section |VII-F i. In Theorem l9] we show the achievability of such 
rate functions when the "maximum likelihood" probability maxg P0(x|y) can be given as a weighted sum of Pe{x\y) (which 
always holds when the number of 9-s is subexponential in n). We particularize this result for rate functions based on empirical 
probabilities (Theorem [T0| and present bounds on the redundancy for the adaptive and non-adaptive case. In the more general 
case where 9 belongs to an infinite class, we do not have a general result on adaptivity, however we show that some properties 
required for the application of Theorem 17] hold in general for the "maximum likelihood" construction (Lemma [Til. 

The rate adaptive scheme presented in Section VII-A is finite horizon, i.e. it requires prior knowledge of the block length n. In 



Section VII-G we present an infinite horizon extension of the scheme, based on a simple "doubling trick". The modified scheme 
attains the results of Theorem |7] under some assumptions. Unfortunately the results regarding rate adaptivity in Section |VII| 
are not as tight and elegant as the results in the non-adaptive case - this manifests itself in the relatively high redundancy of 

the scheme (which generally behaves like O I \/ -^^ I in the block length), as well as its complexity, and the fact we do not 

have a tight lower bound (necessary condition) on the redundancy. 



In Section VIII we present examples for rate functions, which include as particular cases the previous results |1|. The rate 
functions include the empirical mutual information (Section [VIII-A| l, an extension that uses memory in the channel (which is 
optimal for stationary ergodic channels, Section [VIII-B[ ), a discussion on extensions that include time variation (Section [VIII-Q , 
the modulo-additive rate function presented by Shayevitz and Feder |3| (Section VIII-Di, rate functions based on compression 
(Section [VIII-E| i, and a second-order rate function for the MIMO channel (Section VIII-F[ Theorem [13] and Lemma [TO]i. . 



Section 



IX 



is devoted to comments and further research. In Section IX-A we compare with the results of the previous paper 



nu. 

Before beginning the formal parts, several comments are due on the general approach taken in this paper. First, this work 
is theoretical in nature. No effort is made to improve the decoder complexity, or reduce the amount of common randomness 
required. The reason behind this is that we are mainly interested in examining this communication concept. If we see the 
concept is fruitful, the next step should be trying to make it the implementation practical. Also, while we do not attempt to be 
practical regarding the implementation, the requirements from the system do need to be related to practical targets. The second 
comment is that in this work we focus on transmission rate rather than on error exponents. The theoretical reason is that the 
discussion around error exponents is based on the fact the error probability with a fixed rate and a known, stationary ergodic 
channel, decreases exponentially. Here, the rate is not fixed, and the channel is not specified, so this does not necessarily hold 
true. The second reason is practical - from a practical perspective of requirements, there is no reason to require the system's 
error probability to decrease exponentially fast (if at all, the block error rate should be allowed to increase with n). Rather, it 
makes sense to require a small, but fixed, error probability. 

III. Definitions 
The definitions in this section almost identical to the ones stated in the previous paper HI, and are repeated here for 
completeness. The main difference is the absence of the set J. We define the channel, adaptive and non-adaptive systems 
and achievability in the adaptive and non-adaptive sense. If the motivation for these definitions is not immediately clear, the 
asymptotically achievable rate functions /(x; y) and ^ log ^^2 can be regarded as motivating examples. 



A. Notation 

Uppercase letters denote random variables, and respective lowercase letters denote their sample values. Boldface letters 
are used to denote vectors, which are by default of length n. Superscript and subscript indices are applied to vectors to 
define subsequences in the standard way, i.e. x^ = {xi,Xi+i, ...,Xj), x* = x^^. The indices i,j are allowed to exceed the 
range of indices where x is defined (for example be negative), in which case only the indices in the definition range will be 
considered (e.g. x"{^^ = x", x~^ — 0). The indicator function lnd{E) where E is sl set or a probabilistic event is defined as 
1 over the set (or when the event occurs) and otherwise. P o Q denotes the product of conditional probability functions e.g. 
(P o Q){x,y) = P{x) ■ Q{y\x). U(yl) denotes a uniform distribution over the set A. 

M denotes the set of real numbers, and A/'(/i, ct^) denotes a Gaussian distribution with mean /i and variance cr^. ||x|| = Vx^x 
denotes L2 norm. Ber(p) denotes the Bernoulli distribution, and hi,{p) = 7?(Ber(p)) = —pXogp— (1 — p)log(l — p) denotes 
the binary entropy function. 

A hat (D) denotes an estimated value. The empirical mutual information of two vectors /(x;y) is the mutual information 
between two random variables X, Y whose joint distribution equals the empirical distribution of x,y |8, Section II]. An exact 



definition of empirical mutual information and other empirical information measures is delayed to sections VI- A4 and VI-A5 
We denote I{P,W) the mutual information I{X;Y) when {X,Y) ^ P{x) ■ W{y\x). 

The functions log(-) and exp(-) as well as information theoretic quantities H{-),I{-; •),D(-||-) are in base 2 (bits) (and can 
be interpreted as other information units by changing the base of the log). We use In(-) to denote the natural logarithm. 

Bachmann & Landau notations are used for orders of magnitude. Specifically, /„ — 0(g„), means 3nQ,a,l3 > : Vn > 
no ■■ agn < /„ < /35„, /„ G o{gn) or /„ = o(g„) means ^ — ^ and /„ e a;(g„) means ^ — ^00. 

Most of the results apply both to the case where the input is discrete, and characterized by a probability mass function, 
and to the case it is continuous and characterized by density function. When denoting p(x) as the probability of x without 
specifying whether x is continuous or discrete, it means that p(x) may be substituted by either the probability mass function 
or a density function, as applicable. 

Note that proofs are given sometimes after the Theorem/Lemma is stated, and sometimes before it, as seems easier to read. 
In the later case the Theorem/Lemma summarizes a conclusion from a discussion. 

B. Individual channels and rate functions 

Definition 1 (Channel). A channel is defined by a pair of input and output alphabets X,y, and is denoted X ^f y 

Definition 2 (Rate function). A rate function -Rcmp '■ -^'^ x 3^" ^" ^ for the channel X ^^ y may be any real valued function 

of xe A'",y ej^". 

Note that we do not preclude negative values, for reasons of notational convenience. Also, we have defined the set of possible 
outputs as n length vectors 3^" mainly for the sake of concreteness; many of the results in the paper do not assume anything 
about the structure of y, and thus in general, the output does not have to be a vector of the same length of the input. 

C. Fixed rate communication without feedback 

Definition 3 (Fixed rate encoder, decoder, error probability). A randomized block encoder and decoder pair for the channel 
X ^ y with block length n and rate R without feedback is defined by a random variable S distributed over the set S, a 
mapping X : {1,2,... exp(ni?)} x 5 — )■ X" and a mapping rii : y" x 5 — > {1, 2, . . . exp(ni?)}. The error probability for 
message me {1,2,... exp(n_R)} is defined as 

P^^\^, y) = Pr (m(y, S) ^ m|X(m, S) = x) (1) 

where for x such that the conditioning in ([T]) cannot hold, we define Pe (x, y) = 0. 

This system is illustrated in Figure |2] We treat x as a random variable and y as a deterministic sequence. This does 
not preclude applying the results to a channel whose output y is a random variable and depends on x, since all results are 
conditioned on both x and y. Note that the encoder rate must pertain to a discrete number of messages exp(ni?) e Z-(_, but 
the empirical rates we refer to in the sequel may be any positive real numbers. In the sequel, m is treated sometimes as a 
series of bits and sometimes as an index of the message. 

Definition 4 (Achievability). A rate function i?cmp : X^ x y^ — )^ R is achievable with a prior (3(x) defined over A"" and 
error probability e if for any i? > 0, there exist a pair of randomized encoder and decoder, with a rate of at least R such that 
for any message m: X ^ Q and for any x, y where i?emp (x, y) > -R, Pe (x, y) < e. 

We sometimes term this kind of achievability "non-adaptive achievability" to separate it from the adaptive achievability 
defined below. The usage of the notation Pcmp does not immediately imply the rate function is achievable (or adaptively, or 
asymptotically achievable, by the definitions below). We sometimes place an superscript asterisk Pomp* to specify that the 
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given function is indeed achievable. Note that the definition requires that the conditions hold for all R > 0, however this is 
done mainly for convenience, and if we are interested in the achievability of i?cmp at a specific R we can always define a new 



rate function R^ 
the specific R. 



°'"P — whose achievability indicates that the achievability conditions are met for Re 

^cmp < R 



for 



D. Adaptive rate communication with feedback 

Definition 5 (Adaptive rate encoder, decoder, error probability). A randomized block encoder and decoder pair for the channel 
X -^ y with block length n, adaptive rate and feedback is defined as follows; 

• The message m is expressed by the infinite sequence m^ e {0, 1}°° 

• The common randomness is defined as a random variable S distributed over the set S 

• The feedback alphabet is denoted F 

• The encoder is defined by a series of mappings Xk = ^^(m, 5, f'^^^) 

• The decoder is defined by the feedback function fk — <^k{y^,S), the decoding function m(y,S') and the rate function 

r{y,s). 

The random variables X, rh and R denote the outcomes of the respective functions. The error probability for message m is 
defined as 

Pi^-^ (x, y) = Pr f m["«T ^ m["^T |X = x, y) (2) 



In other words, a recovery of the first \nRi\ bits by the decoder is considered a successful reception. For x such that the 
conditioning in (|2]l cannot hold, we define Pe (x, y) = 0. The conditioning on y is mainly for clarification, since it is treated 
as a fixed vector. This system is illustrated in Figure [3] 

In all cases discussed in this paper the feedback is binary F — {0, 1}. Furthermore we sometime consider reducing the 
feedback rate below 1 bit/use. In this case some of the feedback values f^ will be fixed to 0, and the feedback rate is the ratio 
of unconstrained feedback bits. 

Definition 6 (Adaptive achievability). A rate function i?omp : X"^ x y"- -> M is adaptively achievable with a prior (3(x) 
defined over A"" and error probability e, if there exist randomized encoder and decoder with feedback, such that x ~ Q and 
for all xe A'",y e J^": 



Pr 



f / - \nK\ , frifll 



U(i?<i?emp(x,y)) 



X = 



x,y| 



< e 



(3) 



In other words, with probability at least 1 — e, a message with a rate of at least i?cmp is decoded correctly. 



The model in which the decoder determines the transmission rate is lenient in the sense that it gives the flexibility to 
exchange rate for error probability; the decoder may estimate the error probability and decrease it by reducing the decoding 
rate. 



E. Approximate achievability 

Deflnition 7 (Achievability up to a gap). We say that i?emp(x, y) is achievable (adaptively/non adaptively) up to fi (or with 
a gap of /i) with a certain Q and e, if _Romp(x,y)' = ^Ginp(x, y) — /i is achievable (adaptively/non adaptively, resp.) 

Note that /i can be translated to a loss in rate. This is clear in the adaptive case where the rate is a function of i?cmp- In 
the non adaptive case the definition above means there is a system that transmits at rate R — n and achieves error probability 
of less than e whenever i?omp > R (which is equivalent to i?cmp — f-t > R — ^J■)■ 

Deflnition 8 (Asymptotic achievability). A sequence of rate functions defined for n = 1,2, ... is asymptotically achievable 
(adaptively / non adaptively) with a prior Q{x) defined for vectors eA"" of increasing size, if for all e > there exists 
a sequence of functions F„(t), n = 1,2,... with F„(t) — > t, such that f„(i?emp(x, y)) is achievable (adaptively / non 

n— foo 

adaptively, resp.) with the given e and Q(x). 

Note that relating the rate function to the achievable function through Fn{t) — > t is in general weaker than requiring 
that their ratio would tend to 1, since F„(t) — > t does not necessarily uiuformly converge. As an example, consider 

Fn(t) = min(t,n), and the two (equal) functions /„ = g„ = 2n then although ^ — > 1, "^-^"^ — > ^. The reason to 
use this definition is that indeed in many cases of interest, the convergence of the rate function is non uniform. However the 
results are useful since t has a meaning of rate, and the slow convergence occurs only at high rates. 

F. Discussion 

Note that achievability is defined with respect to a fixed prior Q(x). Although the rate function depends on specific sequences, 
for actual communication to happen it is necessary to select input sequences, and Q(x) defines the main property of this selection 
needed for our purpose, i.e. the input distribution. 

The reason for fixing Q is that the achievable rates are a function of the channel input, which is determined by the scheme 
itself. This is an opening for possible falsity - the encoder may choose sequences for which the rate is attained more easily. 
For example, by setting x = one can attain Rcmp — -f (x; y) in a void way, since the rate function will always be 0. We 
circumvent this difficulty by constraining an input distribution, and by using common randomness, requiring that the encoder 
emits input symbols that are random and distributed according to the defined prior This breaks the circular dependence that 
might have been created, by specifying the input behavior together with the rate function. 

In a high level view we can say that the individual channel framework does not contain any tools to modify the input 
behavior - since nothing is assumed on the effect of a change in the input, and therefore the input prior is constrained. From 
this reason, in the current framework we only gain rate adaptivity from feedback, and but do not improve the communication 
rate. In channels with memory, it is possible to improve the channel capacity using feedback, but this improvement is due to 
modification of the input distribution (conditioned on the output). This gain cannot be obtained in the current framework due 
to the constraint on the input distribution. 

Note that these results hold under the theoretical assumption that one may have access to a random variable of any desired 
distribution, which is in some cases un-feasible to generate in an exact manner - see further discussion in our previous paper 

m. 

IV. Fundamental limitations on rate functions 

The selection of rate functions is rather arbitrary. This could be seen by the following example: suppose i?cmp(x, y) is 
achievable, and let tt : 3^" — > 3^" be a permutation of the output values, then clearly i?cmp(x, 7r(y)) is also achievable, by 
placing the permutation tt before the decoder (so that the effective channel output seen by the system is y' — 7r(y)). In general 
none of the rate functions generated by various values of tt is uniformly better than the others. In the sequel we wiU discuss 
possible reasonable ways to choose rate functions, that may eliminate some of these choices. However we start with the more 
basic question: what is the set of achievable rate functions? 

In this section and the following ones we focus only on the non-adaptive case, and characterize the set of achievable 
rate functions. The role of this bound is similar to the role of Kraft's inequality in source encoding - it does not indicate 
a preference to specific encoders, but merely states which encoding lengths are possible (can be implemented by uniquely 
decodable encoders) and which are not. The rate function i?omp takes the role of encoding lengths in Kraft's inequality. 

A. A characterization of the set of achievable rate functions 

The following theorem presents a necessary and a sufficient conditions for a rate function i?emp to be achievable, in the 
fixed sense. 



kQ{Remp > R} 




Fig. 4. Achievable and unachivable regions in TlieoremfT] 



Theorem 1. Consider communication over block length n, with a prior Q and error probability e. //i?cmp(x, y) is achievable, 
or adaptively achievable, then: 



Conversely, if 



yyey\ReR: Q {i?emp(X, y) > i?} < -^ cxp(-ni?) 

1 — e 



yyey\R>0: Q{i?emp(X,y) > R} < eexp(-ni?) 



(4) 
(5) 



then i?omp(x,y) is achievable. 

Where (5{i?cmp(X,y) > R} means the probability with respect to X distributed Q of the event i?omp(X,y) > R. The 
necessary condition refers to both achievable and adaptively achievable rate functions, whereas the sufficient condition only 



refers to achievable rate function (adaptive achievability is discussed in Section VII i. Note that the necessary condition holds 



trivially f or _R < (the definition is extended to negative R-s for matters of convenience, which will become clear later on). 
These conditions are depicted graphically in Figure |4] where the horizontal axis is the rate and the vertical axis is the probability 

Q {^cmp > R}- 

Both bounds characterize the achievability of i?cmp based on the probability of i?omp to exceed a threshold for a fixed value 
of y (its CCDF). The rationale behind this characterization is as follows. Consider the system of Definition [3] and fix the 
output y. Clearly, no information can be transmitted in this case. At each block, there is a codebook of input sequences X^, 
i = 1,2, . . . , Gxp(ni?) that would be transmitted if the input message is m = i. The decoder does not know which of these 
words was chosen but only knows the codebook. However, it guarantees that in high probability it will decode the correct 
word, if this word has i?omp(Xi,y) > R. This is possible only if in most codebooks, only one word satisfies the condition. 
This leads to the bound on the probability of i?omp(Xi,y) > R. 

Note that if a rate function satisfies the sufficient condition with strict inequality (for all or some y-s and R-s), then it can 
be modified to a larger function meeting the condition with equality, by using the inverse transform theorem, i.e. by passing 
the random variable i?omp(X,y) through its CDF to obtain a uniform random variable and then through the desired CDF 
satisfying (|5]l with equality. A remarkable property in the necessary and sufficient conditions is that, since they are given per 
value of y, there is no tradeoff between different y (i.e. one can decide on a rate function separately for each y). Indeed, these 
are only bounds, and in an accurate characterization of the domain of achievable rate functions there is a tradeoff between 
different y-s. But later on we shall see that this property, of separation between y-s holds also in the asymptotical form of the 
bound (Theorem [5]l. 

Following Theorem [T] it is convenient to make the following definition: define the intrinsic redundancy of a rate function 
Rcmp with respect to a prior Q as: 



MQ(^omp) - sup <^ -logQ{i?oinp(X,y) > R} + R} 
y,ReR {n J 



(6) 



This definition simply extracts the normalized coefficient before the exp(— n_R) in Theorem [T] i.e. it is the minimum value /ig 
such that: 

Vy, R : g{ii!emp(X, y) > i?} < exp(n ■ ^lQ) ■ exp(-ni?) (7) 

Theorem [T] can now be stated as follows: 



1) A rate function i?cmp is achievable if iiQ{Rcmp) < -^ loge 

2) A rate function i?cmp is achievable only if /-iQ(i?omp) < ^ log 3^ 

It is easy to see that the inequalities above together with the definition of /ig directly imply the inequalities in Theorem [T] 
Note that the two bounds on /iQ(i?omp) converge to for fixed e as n ^- 00. 

Intuitively the intrinsic redundancy characterizes an overhead that exists in i?cmp and will be expressed in a loss when trying 
to achieve this rate function. The more "ambitious" the rate function, the larger the redundancy. We note the following two 
properties of /ig: 

1) When an offset (5 e M is added to (or subtracted from) the rate function: 

Mgl-Rcrap + (5) = MQ(-Rcmp) + S (8) 

2) When taking the maximum of several rate functions i?omp(x, y) = max^gji x} ^omp^,(x, y), we have: 

fiQ i max i?cmpfc I < max UQiRcmpk) -^ (9) 

-^ — - can be regarded as the price payed for "universality", in the sense of exceeding several rate functions. 
The proof of these properties is straightforward and is deferred to Section IA] 

Suppose that a rate function i?cmp has a given intrinsic redundancy /iQ(i?omp), we may reduce it by an offset S to make this 
rate function achievable. Denote i?omp* = ^cmp^'^i then i?omp* will be achievable if /iQ(i?cmp*) = Mgl^cmp) — S < - loge, 
i.e. if S > figiRc-mp) + ^log^- Conversely, it will not be achievable if Mgl^cmp*) = Mgl^cmp) - ^ > Mog 3^, i.e. if 
6 < ngiRcmp) — ^ log jzr^- Using this argument, we can characterize the achievability of rate functions by specifying what 
value of S (overhead) turns them into achievable. This is formahzed in the following theorem: 

Theorem 2. For a rate function i?oinp to be achievable up to S, with prior Q and error probability e, it is necessary that 
5 > MQ(^cmp) - ^ log 13^ and sufficient that 6 > /iQ(i?oinp) + ^ log ^. 

This theorem gives a meaning to the term "intrinsic redundancy" and we can see how it affects the actual redundancy. 
The actual redundancy is comprised of a term depending on the intrinsic redundancy and a term depending on the desired 
error probability. The proof is given by the discussion above. Using this theorem we can see more clearly the rate penalty for 
decreasing the error probability. Supposing that we know a rate function i?omp is achievable with an error probability ei, then 
we may use the theorem to bound the redundancy required to achieve it with an error probability £2- Furthermore, (|9]l implies 
that competing against K competitors who attain the rate functions i?omp , incurs a small asymptotical price. 

Up to the gap between the necessary and sufficient conditions in Theorems T]2 these conditions are the equivalent of Kraft 



inequality for rate functions. If a rate function meets them, it is tight in the sense that it cannot be improved uniformly. In some 
sense however they are weaker than Kraft inequality, since the later applies to each uniquely decodable fixed to variable code, 
while our conditions apply only to communication systems which attain the error probability individually for each x, y . In 
general, when comparing to information theoretic results pertaining to probabilistic channel settings, because the requirements 
we make are stricter (we require a rate and error probability guarantee per x, y rather than on average), our achievability 
results are stronger, while our necessary conditions (converse) are weaker, since they hold for a restricted class of systems. 

Theorems [T]|2] also bring another observation: any rate function which is achievable (by any system), is also achievable using 
random coding (the system achieving the sufficient condition) , up to a small overhead. 



The gap between the upper and lower bounds of Theorem 1 2 is equivalent to an overhead of log ^^-^ bits over the entire 
transmission. This overhead is 20 bits for e — 10^^, so in the scope of working with a fixed but small e (rather than e — > 0), 

n— >oo 

the difference between the bounds is small. An analysis for the reasons of this gap can be found in ||91. It is shown that the 
necessary condition can be reduced by almost one bit at the price of complicating the decoder and the proof, and cannot 
be further reduced in the current form of the bound. It appears by that analysis that for most rate functions, the required 
redundancy is close to the one required by the sufficient condition. 

B. Proof of Theorem U] 

1) Necessary condition (converse): In this section we prove the first part of Theorem [T] We need to show that the condition 
Q holds for achievable, and adaptively achievable rate functions. We begin with the case of achievable rate functions (non 
adaptively). 

Suppose i?omp is achievable with Q, e. Consider and encoder and a decoder designed for rate R over block size n and 
satisfying Definition H There are M > exp(ni?) input messages. Each input message m = i G {!,..., A/} is translated 
by the encoder into the random sequence X^, which is a random variable distributed in A"" (implemented by the common 
randomness S), and is known to the decoder 

According to the requirements of Definition l4J the distribution of X^ should be Q{x), since the definition requires the input 
distribution to be Q(x) for any input message. However for the converse we assume a milder condition: we only assume that 



the scheme achieves Q on average, i.e. that the input distribution is Q when i is chosen uniformly over {1, . . . , A/}, in other 
words: 

M 

Vx:— ^Pr(X, =x) = g(x) (10) 



M 

1=1 



Note that the codewords may be statistically dependent. 

Denoting by rh the decoded message, then according to Definition pj we have: 

Vy,ie{l,...,Af}:Pr|m^i|i?emp(X„y)>i?} <e (11) 

Note that the definition implies that ([TT} holds with respect to the transmitted message. However, since rh is a function of 
y and S, for a fixed y it does not depend on the transmitted message, and therefore, by considering that any of the possible 
messages may be input to the encoder, and using Definition |4] with respect to this message, we have that ( [TT| i holds for any 
i. Therefore the following holds for any y (where probabilities are over the randomness in the codebook): 

M 

1 = ^Pr{m = z} > ^Pr{(m ^ i) C^ {R,^p{X„y) > R)} 

i—1 i 

= ^ Pr {rh = z|i?emp(X„ y) > i?} Pr {i?emp(X„ y) > i?} 

i 

f ^(l-6)Pr{i?e„.p(X„y)>i?} = (l-e)5] ^ Pr(X,, = x) ^^2) 

I i x:flemp(x,y)>-R 

X:flemp(x,y)>ii; i X:_Romp(x,y)>fl 

= (l-e)A/Q{i?emp(X,y)>i?} 

Therefore 

Q{i?emp(X,y) >R}< _ < ^— exp(-ni?) (13) 

This holds for any y. In addition Definition l4] requires that such a system will exist for any R, therefore ( [T3] l holds for any R 
as well. This proves the claim for the case of achievable i?cmp- 

The case of adaptively achievable i?cmp follows from the same argument. First, one may convert the adaptive rate system 
with feedback into a non-adaptive rate system with feedback: fix a rate R and let the decoder output only nR bits, and an 
error if the rate is -Romp < R- Therefore whenever -Rcmp > Rin probability 1 — e the message will be decoded correctly. Now, 
note that ( [T2| ) refers to any fixed value of y. Therefore ( [T2] i holds even if the encoder knows the value of y, and particularly it 
holds also in the presence of feedback (partial and sequential knowledge of y). Hence the results holds also for Rcmp which 
is adaptively achievable. 

2) Sufficient condition (direct): The direct side is shown by generating the M — [exp(ni?)] codewords X^ i.i.d. with 
distribution Q. Thus, the condition on the input distribution is met. The decoder, after observing y, chooses rh to be the index 
of the word with the maximum value of _Rcmp(Xi,y) (breaking ties arbitrarily), i.e. 

rh = argmax [i?cmp(Xj,y)] (14) 

i 

We assume a given message m, and a given Xm = x. Since the codewords are independent, conditioning on x does not 
change the distribution of the other codewords. By the union bound, the probability of error is bounded by: 

Pi-)(x,y) < Pr I U (i?emp(X„y) > i?cmp(X„,y)) 

< (A/- 1) •Q{i?emp(X,y) > i?emp(x,y)} (15) 

^ (Af - 1) . e . exp(-ni?emp(x, y)) 

< e ■ exp[n{R - i?omp(x, y))] 

where in the last inequality we substituted M < exp(ni?) + 1. Therefore if i?omp(x,y) > R, we will have Pg (x,y) < e as 
required. D 
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C. Comments on the proof of Theorem U\ 

• To understand the proof of the necessary condition, it is useful to think that the channel output y is set to a constant. 
Thus, the decoder is isolated from the encoder, and is required to decide on the message m based solely on its knowledge 
of the codebook. 

• The proof of the theorem teaches something about the way rate functions are achieved: conditioning on x and y, the 
different codebooks generated all include x, and in addition other codeword. If i?cmp(x, y) is large, then in most codebooks, 
the other codewords will have a smaller value of i?emp, due to the constraint on its distribution. Therefore, by choosing 
the word with the maximum -Rcmp- the decoder would usually be correct. The necessary condition means that this is 
actually required to happen in order for i?cmp to be achievable: as the decoder is "isolated" from the encoder, and still 
committed to ( fTT) . If there are several words with i?emp(Xi,y) > R the decoder will need to toss a coin and split the 
distribution in some way between them, with a large probability to be in error The analysis of the gap between the 
necessary and sufficient condition in |9| sheds more light on this topic. 

• By the current definitions, it is assumed that the input distribution Q does not depend on y. However note that since 
the proofs of the necessary and sufficient conditions both consider a fixed value of y, the results hold, under a suitable 
formulation, also for the case where the input distribution depends on y. 

• We can adopt two point of views when considering systems satisfying Theorem [T] (the achievability of rate functions): 
one is as communication systems trying to convey messages over an unknown channel; another is a cynical perspective 
in which we do not assume the input and output are related (and thus it is impossible to convey information), but we are 
only trying to design systems that satisfy the promises of the theorems, and the question is viewed as a game between the 
encoder and decoder, and the environment choosing y and the message. The first point of view gives us the motivation 
and application of the theorems; the second is more suitable for the design and analysis. This is similar to the case of 
prediction and learning with expert advice ifTOllfTTI - when designing these learning algorithms the assumption is that the 
information supplied by the experts is completely arbitrary, and therefore the target is not to "learn" but just to compete; 
but the application of the results is for learning (where we assume there is some information at least in some of the 
experts advice). 

D. Examples 

Example 1 (A wire). Consider the binary input - binary output channel A" = 3^ = {0, 1} with the rate function i?cmp = 
Ind(x = y), i.e. i?cmp = 1 iff the output is identical to the input. This function is easily achievable, with Q{x) = U(A'"). 
To attain this rate function without error e = 0, one simply transmits the message un-coded, at a rate i? = 1. If the channel 
output happened to equal the input, the communication had succeeded. If it happened to be different, i?cmp ~ < R and 
thus no guarantee was made. (5(x) needs to be uniform in order to achieve rate 1. For this rate function and any R < 1, the 
condition i?cmp > R is, satisfied by one sequence, and therefore Q{i?omp > R} = ^- This satisfies the necessary condition 
in Theorem [T] with equality for e = 0, and thus the sufficient condition is not tight here. 

Note that the codebook that achieves this rate function is not a random i.i.d. codebook - the codewords are fixed, or, in order 
to achieve the input distribution condition, should be generated by randomly permuting the 2" possible sequences. Therefore 
the codewords are correlated, which is necessary in order to obtain the necessary condition. Furthermore, the regions of x G X^ 
for which i?cmp(x, y) > obtained for different y-s are disjoint, in which case, as we have noted, the necessary condition 
could be tight. If we had insisted on generating the codewords independently, then this rate function could not be achieved 
without some loss, due to the probability of two codewords being equal, therefore in that case the maximum rate would be 
closer to rate determined by the sufficient condition. 

Example 2 (A fixed codebook). Similarly, consider transmission using a fixed codebook of M — exp{nRo) codewords, and 
an arbitrary fixed decoder We may randomly permute the messages in order to guarantee a fixed input distribution for any 
message. In this case Q(x) = jj when x is in the codebook and otherwise. Define the rate function _Rcmp(x,y) as Rq if 
y is decoded by the decoder to the message represented by x, and otherwise. Then for R < Rq, Q{i?cmp > R} = jj = 
exp(— ni?o) < exp(— ni?), and as before the necessary condition is satisfied with equaUty for e = 0. 

Example 3 (The empirical mutual information). Lemma 1 in our previous paper fH states that for any i.i.d. prior Q, 

g f /(x; y) > i?) < cxp {~n {R ~ 6n)) with 5„ = |A'||y| '°g<"+^^ — > 0. Therefore /xq(/) < (5„, and the conclusion 

from Theorem pi is that this function is achievable up to 5n + ^ log K Note that the actual intrinsic redundancy is about half 
of this bound (see Section [VIII- A| i. 

Example 4 (A second order rate function). The rate function i?omp = ^ log jz^^ presented in the previous paper fTl has an 
intrinsic redundancy /iQ(_Romp) = oo. This results from the factor n — 1 instead of n in Lemma 4 there, which causes the fact 
— - logPr(_Rcmp > R) grows slower than R for large values of R. The implication is that this rate function cannot be attained 
with a fixed loss, but the loss must grow with R. So for example one cannot attain i?cmp — 3, but one can attain 7 • i?omp 



(with 7 — > 1). The proof is technical and is deferred to Appendix E5 
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E. General systems and Good-put functions 

The requirement to attain a fixed error probability for every x, y releases the characterization of the communication system 
from dependence on the channel. On the other hand, it may seem as an over-requirement, since from application perspective 
requiring low average error probability may be sufficient. In this section it is shown that this over-requirement is not as 
strong as may seem: any communication system may be converted to a system guaranteeing a small error probability, with a 
small price in the ratejHThis result holds in full generality only for the non-adaptive case, however considering the sub-set of 
adaptively achievable rate functions presented in Section |VII| it makes sense to believe that for many systems of interest, this 
will hold also adaptively. Thus, the concept of attainable rate functions is not as esoteric as it would initially seem. 

Let us consider a system delivering a rate i?,y, with an error probability e^y^. This system may be quite general. To fix 
thoughts, it may be useful to consider the two examples of a practical (Turbo/LDPC) encoder and a decoder, perhaps combined 
within a more complex system involving channel estimation, feedback, scrambling, etc, and on the other hand, a theoretical 
random coding system. Each system generates a certain input distribution (5(x) = Q^y^'x), which is assumed to be independent 
of the channel output. 

In order to characterize the system with a single number, consider the rate of error-free bits delivered by the system, 
sometimes referred to as "good-put" (in contrast to throughput): 

i?,ood = (l-esy.)i?sy.. (16) 

This value is a little optimistic, because it ignores the need to detect the errors. As an example, delivering one bit per second 
with error probability half is not the equivalent of half a bit per second. This additional gap is related to the factor hi,{e^y^) in 
Fano's inequality, and is asymptotically negligible. Now, assuming that e^y^ and R^^^ are not fixed but may change (depending, 
e.g. on the channel, on common randomness), the good-put is the average of the above, i.e. 

i?,„„, =E[(l-e,yJi?,J. (17) 

To obtain a characterization of a system, which is independent of the channel, the above may be conditioned on the channel 
input and output x, y. Define 

i?«ood(x, y) = E [(1 - e3y,)i?,y, |x, y] . (18) 

In other words, i?g„„d(x, y) is the average good-put obtained with the system when the input and output happened to be x,y. 
For a deterministic block encoder/decoder, the conditional error probability is either or 1, and the good-put is, respectively, 
either R^y^ or 0. The function i?g„od(x, y) is only a function of the system and not of the channel, and when a specific 
probabilistic channel is known, the average good put may be computed as R^-^^d = E [^good(X, Y)]. 

Next, let us show that for any system, i?good(x,y) is an asymptotically achievable rate function (with the prior Q(x) = 
Qsysi^))- Initially, it is assumed that R^y^ is a constant, i.e. the system delivers a constant rate, with a varying error probability 
esys(x, y). Assume the message m is a uniform random variable U{1, . . . , M}, M = exp{nR^y^). The system is defined by 
common randomness S (possibly), a transmission function X(5, m) and a decoding function m.{S, y) (see Definition [3]). Now, 
consider the system's operation when y set as a constant. Any feedback the system might have, can be ignored, as it conveys 
constant information. In this case, rh.{S, y) and m are independent, and: 

Pr{m(^, y) = m} = ^ = exp(-ni?,yj. (19) 

The error probability is 

e.y.(x, y) = Pr {m(5, y) ^ ni\-K{S, m) = x} . (20) 

Now, 

exp(-ni?,y,) = Pr{m(S',y) = m} 

= ^ Pr {m(S', y) == m n X(5, m) = x} 

X 

= V Pr {m{S, y) = m|X(S', m) = x} • Pr {X(5, m) = x} 

(21) 



E 



^--(-'y)Q(x) 



Rsya 
X ^ 

'Practically, the later system may be more complex to implement. 
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For any R < R^^^, the sum above is bounded by : 



^ R. 



'sys 



Q(x) 

^good(x,y) 



> >. "°r ^" Qw 



x:-Rgood(x,y)>-R. 



R.. 



> ^ E QW 

= -^Pr{i?,„,(X,y)>i?} 



(22) 



Combining ( |2T| and ( |22] i, yields: 



Pr{i?,„„,(X,y) > i?} < ^ exp(-ni?,,J. (23) 

For a; > 1 the function xe~^ is decreasing. Substituting log(e)a; — nR, yields that Rexp{—nR) is decreasing with R for 
R > -2^pi, and therefore — g^ exp(— ni^^yj < exp(— ni?). For i? < -'^^p^ (where exp(— ni?) > e^^), the probability above 
(|23|l can be simply upper bounded by 1. This yields the following simple bound: 



Pr {i?,ood(X, y) > i?} < e • exp(-ni?). (24) 

For the case of i? > R^y^, the above holds trivially. The bound above corresponds to the sufficient condition of TheoremfTl with 
an intrinsic redundancy of iiQ{Rg„„a) < °^^^'^' , and is therefore it is asymptotically achievable (Theorem 2k. Notice that the 
system achieving this rate (Section [I V-B2| i is potentially very different than the original system. Furthermore, the bound leading 
from (|23| to ( p4j i is very coarse, which implies the good-put is a very pessimistic bound on the rate that can be achieved. This 
is because the error probability can be exponentially improved with a decrease in the rate, while in the good-put function, 
there is only a linear decrease (e.g. the error probability when attaining i?good — -^R^ys is 5 with the original system, whereas 
it could have been significantly better). The extension to rate adaptive systems appears in Appendix IB] . This is summarized 
by the following Lemma: 



Theorem 3. The good-put function ( |18[ ) of any fixed-rate or adaptive rate system (Definitions i]5i, possibly including common 



randomness and feedback, is an asymptotically achievable rate function, with the prior generated by the system 's codebook 
distribution, and has an intrinsic redundancy of jig (R^^^^) < °^ . 

An interesting and insightful resulting of the combination of Theorem |3] and Theorem |5] which is proven in Section |VJ is 
that the rate of any system can be characterized by two probability functions P(x|y) and Q{x) (where the second is the input 
distribution). 



If, furthermore, this achievable rate function satisfies the structure defined in Section VII then it is also asymptotically 
adaptively achievable. I.e. there exists a system attaining the same rates, but with an error probability as small as desired, per 
any pair of sequences. 

V. An asymptotical characterization of achievable rate functions 

In TheoremfTl we have shown that achievable rate function have a CCDF upper bounded by a decaying exponential function. 
Therefore it stands to reason that the Chernoff bound for the probability (5(i?omp(X,y) > R) may be rather tight. From this 
observation we derive asymptotical necessary and sufficient conditions which are easier to calculate. The main result of this 
section is that asymptotically achievable rate functions are bounded by the form ^ log q,^) for some conditional probability 
assignment /(x|y). As a result this form can be used as a prototype for rate functions. 

A. The Chernoff and Markov inequalities 

The Chernoff and Markov inequalities are useful tools in the following analysis. The Markov inequality simply states that 
for any non-negative random variable A, 

P,.{A > i} < ^ (25) 

The proof is simple, by applying the expected value operator to both sides of Ind(A > i) < j. From this simple bound, 
many useful bounds can be derived, for example the Chebyshev inequality is obtained by substituting A = {X — E[X])'^. 
The Chernoff upper bound for Pr{X > r) is obtained by substituting A = exp(/3X),i = exp(/3T) for some constant /3 > 0, 
and then optimizing over /?. The main strength of Chernoff bound results from the fact that when X is a sum of independent 
random valuables X = J2i ^i^ then ¥.[A] = E[exp(/3X)] = ¥.\^,^ exp(/3Xi)] = Y{^ E[cxp(/3Xj)] breaks into a product of terms 
associated with each individual element, which is in most cases simpler to calculate. Since information theoretic values are 
associated with log-probabilities, the Markov and Chernoff bounds are virtually the same in our context (the Chernoff bound 
when applied to the log-probabilities is equivalent to the Markov inequality applied to the probabilities). 
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B. Application of the Chernojf bound 

Consider a sequence of rate functions _Rcmp(x",y") for n = 1,2,.... We would like to find out whether i?cinp is 
asymptotically attainable. Although i?omp may be asymptotically attainable, the intrinsic redundancy associated with it may 
not tend to zero. In other words, it may be possible to attain F„(i?omp(x"', y")) (with F„(i) — > t), but Fn is not necessarily 
of the form Fn{t) = t — Sn with (5„ — > 0. Therefore it is useful to consider more general functions F„(i). As an example 

n-i-oo 

for such a case see the rate function for the continuous MIMO channel presented in Section |VIII-F| which is achieved up to 

Fn{t) =7„t-(S„. 

We consider the rate function F„(i?omp(x, y)). Using the Chemoff/Markov inequality to bound the probabilities in Theo- 
rem [T] we have: 

Q{K(i?cmp(X,y)) >R}^ Q{exp(ni^„(i?emp(X,y))) > exp{nR)} 

(26) 

" E [exp(ni^„[i?cmp(X,y)])] cxp(-ni?) = Lp^n ■ exp(-ni?) 
w 

where 

Lp.n = E [cxp(riF„[i?emp(X,y)])] (27) 

In many cases, for a suitable choice of F, such as F„ ~ jt, calculating Lpn is simpler than calculating the probability 
'5{^omp(X,y) > R}. From this bound we have that the intrinsic redundancy (|6]l of Fn[Rcmp] satisfies 

(2§J6) 1 

flQ{Fn[Rcmp]) S -^OgLp^n (28) 

by Theorem this implies that F„ [-Romp] is achievable up to (5„ = - log Lp „ + - log - . If for any sequence F„ (t) — > t, we 

I I n ' n c n— >oo 

have - log Lp „ — > (in other words, Lp „ increases subexponentially with n), this implies that F„ [-Rcmp] ^ ^n is achievable 
where (5„ — > and therefore i?omp is asymptotically achievable. On the other hand, as we show below, this condition is 
also necessary. This manifests the claim that the use of the Chemoff bound is asymptotically tight. 

C. Asymptotic tightness of the Chernoff bound 

Theorem 4. A sequence of rate functions i?omp(x",y"') is asymptotically achievable with a sequence of priors (5(x"), iff 
there exists a sequence of functions Fn(t) — > t, such that for all y: 

limsup — log Lp^n < (29) 

n— ^oo ^ 

where Lp ^ is defined in ( |27| l. 

Note that comparing with the conditions of Theorem fl] which are conditions on the CCDF of i?omp(X,y) and must be 
satisfied per R, the condition above is a simpler condition on an expected value, which doesn't explicitly refer to the rate R. 
Let us begin with the following lemma which is the heart of the reverse part. 

Lemma 1. Any achievable rate function Rcmp (with e,Q) satisfies for 7 < !.• 

Vy : ^E^ [exp(n7i?emp(X, y))] < ^^ _ ^^^^ _ ^^ (30) 



Proof: Suppose that -Rcmp achievable, by Theorem [T] this implies 



1 



Vye3^",i?eM:Q{.Remp(X,y)>.R}< — - exp(-ni?) (31) 



Intuitively it is clear that this constraint on the CCDF of -Rcmp implies the exponential factor in ^Svf is canceled out by 
the exponential decay of the distribution. For a fixed y, define the random variable V = cxp(— ri-Rcinp(X,y)) and substitute 
r = cxp(— 71-R). Then the above can be written as a condition on the CDF of V, Fy(r): 

Vr > : Fv{r) ^ Pr(y < r) = Q {cxp(-ni?cmp(X, y)) < cxp(-ni?)} 

= Q {i?cmp(X, y) > -R} < ^ exp(~n.R) (32) 



1-e 
Next, this condition on the CDF is translated to a conclusion on the expected value. Since by definition Fv{r) G [0,1] we 
can write the bound as Fv{r) < Fu{r) = min ( j^-^, 1 j, i.e. Fy{r) is bounded by the CDF of a uniform random variable 
U ^ U[0, 1 — e]. This implies that we can bound V >U, ?& formulated in the following Lemma: 
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Lemma 2 (CDF inequality). Let V be a random variable and let the probability function ofV be bounded by Fv{x) < Fij[x), 
where Fjj{x) is a probability function and is monotonically increasing for all x such that < Fij{x) < 1, then there exists a 
random variable U ~ Fu such that V > U. 

Proof Since Fu{x) is monotonically increasing it is invertible for values in the region (0, 1). Let U — FjJ^(Fv{V)). Then 
by the well known inverse transform theorem Fv{V) is uniform U[0, 1] and therefore by applying Fj^^ we obtain that U is 
distributed according to Fu. Since Fu is monotonically increasing, so is its inverse. Thus by applying FjJ^ to both sides of 
the inequality Fv{V) < FuiV) we obtain U <V. D 

Returning to the proof of Lemma [T] let U ^V[0,1 — e\ be a random variable that satisfies U < V, then 

E[exp(n7i?omp(X,y))] 



= E 



<E 



u-fl 

1 
< 



[/7_ 
dii - 


(1-6)1-^ 




(l-6)(l-7) 



y7 

l-C ^ ^ (^ _ ^^,l-7 (33) 



(l-6)(l-7) 

The condition 7 < 1 is required for the integral to exist. DNotice that it is possible to prove the result by using integration in 
parts, however the current proof technique avoids any continuity/integrability assumptions. 
Proof of Theorem K 



E[exp(n7„^„[i?emp(X,y)])] < —_^ -^. (35) 



Direct part: if d29b holds for some sequence Fn{t), then there exists an upper bounding sequence 6n — > such that 

_ n— >-oo 

- log Lp^ < Sn, therefore by Theorem 2 and ( |28| l, we have that Fn[Rcmp] is achievable up to 

/XQ(F„(i?omp)) + - log - = - log Lp^n + - log - < (5„ + - log - (34) 

n e n n e n e 

Therefore defining Gn{t) = i^„(i)- (?« + ^ log \), we have that G„(i) —^ t, and G„(i?cmp) = F„{Rcmp)~ (Sn + \ log -^ 
is achievable, and therefore by definition i?cmp is asymptotically achievable. 

Reverse part: Suppose that i?emp is asymptotically achievable. Then by definition for any e, there exists a sequence of 
functions F„(i) — > t such that Fn[Rcmp] is achievable. By Lemma [T] this implies (for 7„ < 1): 

1 

(l-6)(l-7)- 

Defining Gn{t) — 7n • Fn{t), then by definition ( |27] i the LHS equals Lq n- Choosing 7„ = 1 — - we have that G„(i) — > t, 
while 

Lcn = E [exp(7iG„[i?emp(X,y)])] "^ 77^, (36) 

and therefore 

lim sup — log Lc.n < lim sup — log ■ 

= lim — log 

n-foo n 1 — e 

which satisfies (|29ll. D 



P n 


- {l-eY 


n 


" -0 



D. Conditional probabilities and rate Junctions 

We now apply Theorem [4] to obtain a more intuitive form for the asymptotical rate functions. We assume that the conditions 
of Theorem l4] hold. For the sake of discussion, let us for the moment replace the limits with equalities, i.e. assume that 
^ logLF,n = (i.e. Lf.k = 1) and F„(t) = t. Then by definition ( p7] i we have: 

Lp^n = E[exp(r7,i?cmp(X,y))] = ^ Q(x) exp(ni?cmp(x,y)) = 1 (38) 

Denote the summand: 

/(x|y) = Q(x) exp(ni?emp(x, y)) (39) 
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then ( |38| ) implies J^x /(^|y) = 1 fo'" every y. Therefore /(x|y) is a legitimate conditional distribution on x. By inverting the 
relation ([39|, i?omp is written as: 

i?cmp(x,y)--log4^ (40) 

The considerations above remain the same for continuous input, by replacing the sum with an integral. Note that this rate 
function is not defined for x with Q(x) = 0, however by the definitions of achievability, the values of i?cmp for such x have 
no consequence, and therefore we may leave them "undefined". This form ( |40] i provides a general way to obtain rate functions 
which are achievable up to a small factor. Specifically, since rate functions of the form (|40b have by definition Lp=t,n — 1, 
they have /iQ(i?oinp) < ( |28| ), and are therefore, achievable up to (5„ = ^ log ^ (Theorem |2|. This observation is formalized 
below. 



Lemma 3. For any conditional distribution /(x|y), the rate function defined in ( |40[ ) has /^Q(i?emp) < and is achievable 
(with a prior Q and error probability e) up to 6n — - log -. 

On the other hand, it is also possible to give a lower bound on the redundancy of this rate function (the reverse of Lemma [3]) 
by using the proof technique from Theorem |4] The following Lemma is proven in Appendix [P] 



Lemma 4. If the rate function defined in ( |40| i satisfies Rcmp < ^max G ^ , then this function is achievable (with a prior Q 
and error probability e) up to 6, only if S > i ^~' — 

The fact the bound is negative is not surprising, since this rate function has a non-positive intrinsic redundancy. Using both 
Lemmas we can bound the redundancy S up to an order of 0{-^^^^). . 

The main result of this section states that all rate functions are asymptotically bounded by the form of ( |40| i (for some /). 
I.e. this is a general way to construct all asymptotically achievable rate functions. 

Theorem 5. A sequence of rate functions i?omp (x" , y" ) is asymptotically achievable (with a sequence of priors (5(x")j, iff 
there exist a sequence of functions Fn{t) — > t and a sequence of conditional distributions /(x"|y") such that 



F„[i?„np(x",y")] < - log ( ' ^,,1^, ' ) (41) 



n ^\ g(x") 

Proof: Direct part: if (41]) holds, then i^„[i?emp(x", y")] is upper bounded by the rate function (|40]l, which is asymptotically 
achievable by Lemma [3 and therefore by definition i?cmp(x",y") is asymptotically achievable. 

Reverse part: suppose Rcmp is asymptotically achievable, then by Theorem HI for some F„ and a bounding sequence (5„: 

-\0gLp^n<Sn -^ (42) 

Define 

/(x"|y") = Q(x) ■ exp(Tif„[i?emp(x",y")]) ^^^^ 

Lp^n 

by definition of Lp.n ( |27| ), the denominator is the sum over x of the numerator therefore /(x"|y") is a conditional distribution. 
Extracting Lpn from ( |43] ) and substituting in ( |42] i we have: 

1, , 1, ^ Q(x).cxp(nf„[Jt;en.p(x",y")]) 

- log Lp^n - - iOg 

n n \ 7(x"|y") 



1, /^ /(x"|y" )^ (44) 



K[i?cmp(x",y")]--log 



Defining Gn(t) — Fn{t) — (5„ we have that 

1 , //(x"|y") 



n 



Q(x) 



G„ [i?cmp (x" , y" )] < - log '-^y^j^ (45) 



Therefore G„ satisfies the conditions of the theorem. D 

E. Manipulating rate functions 

Following the results of this and the previous section we can consider various manipulations of rate functions. 



In Section IV-A we have seen that when taking the maximum over K rate functions, the increase in the intrinsic redundancy 

is at mostT^S^. 
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Theorem [T] states the achievability conditions separately per y. Therefore if we have two rate functions that satisfy the 
sufficient condition, and we mix them by arbitrarily choosing for each y one of the rate functions, the resulting rate 
function is achievable. 
Suppose that we have K sequences of rate functions of the form 

i?e.„p('=nx,y)--log^^^ k^l,....K (46) 



n 



Q(x 



i?c„.p(x", y") = 3 log ^^;,^r'"^ = 3 log ^^X:"r" + ^^^ (47) 



By definition this rate function has a non-positive intrinsic redundancy. Then the following rate function: 

n Q(x) n (5(x 

satisfies i?omp > Rcmp (as visible from the first expression in (|47|), and has intrinsic redundancy at most °^ (as 

visible from the second expression in (|47]l). 
These results have analogs in universal source coding. In source coding, given K encoders with encoding lengths /fc(x) = 
— log(pA;(x)) (for the source sequence x), by defining the universal distribution p{x) — j^ ^pfc(x), one obtains the encoding 
lengths /(x) — — log(p(x)), which satisfy Z(x) < ?fc(x) + \og{K), i.e. there is a regret of at most \og{K) compared to the 
K encoders. This fact, that stems from the logarithmic relation between probabilities and encoding lengths is the basis for 
universal encoding (since the normalized penalty -^ — - vanishes as n — ^ cx)). Similarly in our case, the logarithmic loss in 
the number of competitors will be the basis for universally competing with multiple models. 

F. Discussion 

The definition of asymptotical achievability: As we have noted, the definition of asymptotical achievability is rather loose, 
by allowing any Fn (t) — > t that translates the rate function to a strictly achievable one. This is done mainly for the sake of 
the adaptive case, in which, as we shall see, F„ takes various forms, usually non linear However for the non adaptive case, 
the definition could have been narrowed by considering only Fn{t) of the linear form F„(i) = jn ■ t — 6n with 7„ — > 1, 

n— >-oo 

Sn — > 0. All results in this section would be true also under this restricted form of Fn{t). 

VI. Constructions for rate functions 

In the last two sections we have defined the conditions for achievability of rate functions, but haven't dealt with the selection 
of the rate function out of all achievable functions. In this section, we deal with the problem of selecting the rate function. We 
define constructions for rate functions which have meaningful structure. This is similar to choosing, from all encoders which 
comply with Kraft inequality, those that compete well with all encoders based on a family of models. We propose two main 
constructions: 

1) ML construction: Rate functions that guarantee achieving the mutual information rate over a family of potential channel 
distributions. 

2) Rate functions that are defined via a certain parameterization or classification of sequences. 

These constructions supply reasoning for choosing a specific rate function, give a uniform way to construct several rate 
functions that seem to be of interest, and will allow us later to prove general claims referring to the construction (rather than 
specific to a certain rate function). 

A. Empirical distributions and information measures 

We begin with some definitions that will be useful in the sequel. The definitions below are applicable to probability 
distributions or probability density functions, unless stated otherwise. 

1) Empirical distribution: Given sequences (or equivalently vectors or ordered tuples) a ~ (ai)r=i' t) — {bi)2=i where 
Ui £ A,bi E B and A, B are discrete alphabet sets, we define the empirical distribution: 

P.{a) = F(a.)" , a = ^^^ — aeA (48) 



and the conditional empirical distribution 

P 

h.\^ (alb) = — 



P(a b~l" {o,, b) 

PMb.n^M^) = 5 '^ , ' aeA,beB (49) 



For example Pfxi\xi^i,xi^2)'^° {io\S:-i,X-2) yields the empirical distribution of each value in the sequence Xj^ given the two 
previous values. The empirical distribution of a sequence x denoted -Px(x) is just the zero order empirical distribution. 
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2) Empirical probability: Given a probability law Q (x) , the probability of the sequence x is Q (x) . The empirical probability 
of the discrete sequence x, is the probability of the sequence under the i.i.d. empirical distribution of itself, and denoted p(x). 
I.e.; 

n 

p(x) = (4r(x) = n^x(x.) 

»=i (50) 

= n n ^x(x) = n pa^t^-^'^ 

x^X i:Xi—x x^X 

Note that the empirical probability is, in general, not a legitimate probability distribution (but a super-distribution, i.e. it has 
^^p(x) > 1), as we shall see below. 

Similarly, we define the conditional empirical probability, as the probability of the sequence under the conditional empirical 
distribution of itself (induced by another sequence). To keep the definitions general we denote the conditioning sequence by 
z e Z" (here and in the sequel). This conditioning sequence may include y or possibly delayed or modified versions of 
X and y. For the purpose of this section it does not matter whether z is derived from x since all sequences are fixed. The 
conditional empirical probability means that for each set of symbols in x for which a certain symbol in z appears, i.e. z^ — z, 
we separately measure the empirical probability. 

n 

P(x|z) = J|-Px|z(a;i|2:i) 

^=1 . (51) 

= n n ^xiz(^.ko= n PMiv^r'^''^^'"'^ 

x^X ,zGZ i:Xi—x,Zi—z x^X .z^Z- 

3) Maximum likelihood probability: In structuring universal schemes, we many times base a universal model on a wide 
class of probabilistic models |7| (attempting to beat each model in the class). The definition of maximum likelihood probability 
generalizes the definition of empirical probability above, and provides a useful tool for constructing rate functions. 

Denote by pe{x.) a class of distributions over the sequence x, with the index e 8 (the class Q not necessarily finite or 
countable). The maximum Ukelihood estimate of 9 from x is 

^ml(x) = argmaxpe(x) (52) 

e 

The maximum likelihood distribution defined by x is the distribution defined by the parameter 9 = 6'ml(x). The maximum 
likelihood probability of the sequence x, is the maximum probability given to x by any member in pg (x), or can be alternatively 
written as the probability of x under the maximum likelihood distribution: 

Pml(x) = maxpe(x) = ^^^^^(^^(x) (53) 

By definition Pml(x) satisfies Pml(x) > pe(x). Except in degenerate cases, Pml(x) is not a probability distribution, but a (strict) 
super-probability. Specifically, if we have two different distributions pi(x),p2(x), then at least at one point pi(x) > P2(x) 
(or equivalently p2 > Pi) therefore the sum X^xgat" Pml(x) = Z^xsat" inax(pi(x),p2(x)) > J2xeX'- P'i-i^) = 1' ^i"'^^ *^ 
summand is at least pi and larger than pi at at least one point. 

The definition extends trivially to the conditional case. Using a class of conditional distributions pe(x|z) with respect to the 
generic sequence z € Z", every fixed value of z induces a set of probabilities on x. We define 

Pml(x|z) = maxp0(x|z) (54) 

6 

Note that the class of conditional distributions pg{x.\z) may be derived from a class of joint distributions pg{:x-, z), but this is 
not necessary. 

For discrete sequences, taking Q to be the class of i.i.d. distributions (defined by the probability 9{x),x E X for each value 
of x) 



Pe 



(x)=J]0(x,)^r(x) (55) 



we have that the maximum likelihood distribution is the empirical distribution of x, i.e. 

Pml (x) = max 61" (x) = p(x) (56) 
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This is shown below: 

logpe(x) 

i=i xex 

(57) 



i=l 




{i)\oge{i) 




n> ;Px(£)lo 
xex 


xex 
-nD{P^\\e)<\ogPg^p 


Px(5) 



Therefore 6'ml(x) — argmax p^i (x) = Px. As a result, the empirical probability of x, p(x), equals the maximum likelihood 

probability of x under the i.i.d. model class. Therefore the maximum likelihood probability is a generalization of empirical 
probability, which is not limited to discrete sequences, and can be applied to continuous sequences, and include time structure. 

Another consequence of the fact that Pml(x) = p(x) for the class of memoryless models is that for any i.i.d. distribution 
Q"(x), and every sequence: j5(x) = maxe pg{x.) > (3"(x) (since Q G G). 

The same result holds for the conditional case, i.e. defining the class 6 as the class of conditionally memoryless models 
Pe(x|z) = nr=i ^(-^il-^j)' ^^ have that |5ml(x|z) ~ p(x|z). To see that, note that the distribution pg{x.\z) can be written 
as a product of the distribution of sub-vectors of x which have constant Zi (i.e. all indices for which Zi = z). Each of 
these sub-vectors has an independent set of parameters 0{-\z), and maximizing the probability over 9 implies maximizing the 
probability of each sub-vector separately. As we have seen above, this maximization yields the empirical probability of x over 
the sub-vector Therefore the maximum is obtained for 9{x\z) — Px|z(i|-S)- 

4) Maximum likelihood, empirical and quazi-empirical entropies: Given a probabiUty distribution p{x), the self information 
of the element x is defined as 

log -^-^ (58) 

p{x} 

and the entropy is the expected value of the self information: 

1 



H{X) = E 



log 



P{X) 



p{x)\ogp{x) (59) 



We define the quazi-empirical entropy of a sequence x with respect to a model p{x) as above expression, where the expected 
value is replaced by the empirical expectation: 



ffp(x)^E 



log 



P{Xt) 



1 " 

-log]Tp(a 



P^{x)logp{x) = Vlogp(a;,) 

77 Z ^ 

(60) 



^T^ n 



= logp"(x) 

n 



The last expression implies that the quazi-empirical entropy is the normalized self information of the sequence x, with the 
i.i.d. probability p. 

For discrete sequences, the empirical entropy of a sequence x, y is defined as the entropy of the random variable with 
the distribution X ^ -Px(x) JS] Section II]. The empirical entropy of a sequence x is obtained from (|59| by replacing the 
distribution p{x) with the empirical distribution Py^{x): 

H{-K) = -Y,P^{x)\ogP^{x) (61) 

X 

Equivalently using (|50lt we may relate -^(x) to the empirical probability: 



i?(x) = --logp(x) (62) 

n 

This supplies an intuitively appealing way to understand H as the normalized self information of the sequence, under its 
estimated i.i.d. probability Px- Equivalently we may write the empirical entropy as the quazi-empirical entropy using the 
empirical distribution H{x) — Hp (x). From the relation between the empirical probability and the maximum likelihood 
probability p(x) — maxpp"(x), we have that 

if (x) = logp(x) = — max — logp"(x) = mini?p(x) (63) 

n p n p 

I.e. in extracting the i.i.d. model extracted from x (rather than using an arbitrary p) we minimize its quazi-empirical entropy. 
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As an extension, given a class of models P6/(x), 6' e O, we may define the maximum likelihood entropy of a sequence as 
the normalized self information of the sequence under the maximum-likelihood distribution. 

^ml(x) = logpML(x) (64) 

n 

As before, all relations extend trivially to the conditional case (conditioned on the generic sequence z), by simply considering 
each sub-vector of x related to a specific value in z. I.e. 

1 " 1 

i/p(x|z) = --^logp(x,|z,) = --logp"(x|z) (65) 

1=1 

i?(x|z) = - V'Pxzli, z)logPx|z(i|2) = --logp(x|z) = min^p(x|z) (66) 

•^ — ' n p 

While the standard chain rule holds for empirical entropies (being entropies of dummy random variables), it does not, in 
general, hold for entropies defined by maximum likelihood probabilities. Since, in general, we have: 

Pml(x, z) = maxPe(x, z) = max [Pe(z)P0(x|z)] 

see 0ee ,,_. 

< maxPe(z) •maxPe(x|z) =_25ml(z) ■ Pml{^\z) 
Bee eee 



i7ML(x,z) = --logpML(x,z) > --Pml(z) Pml(x|z) = -ffML(z) +i7ML(x|z) (68) 

n n n 



Then 

However, equality holds in ( [67| i, ( |68| ) when the parameters 9 can be separated into a set of parameters 0^ controlling Pe(z) and 
a set Ox\z controlling Pe(x|z). This occurs for example in the discrete memoryless case (where H„^ is the empirical entropy), 
since the single letter distribution d{x,z) can be separated into d{z) and 0{x\z), and therefore we have equality in this case. 

5) Empirical mutual information: Similarly to the empirical entropy, the empirical mutual information of two vectors /(x; y) 
is defined as the mutual information between two random variables X, Y with the joint distribution {X. Y) ^ Px.y{x, y), i.e. 
whose joint distribution equals the empirical distribution of x, y [S", Section 11]. This way of defining the empirical mutual 
information and empirical entropy as mutual information/entropy of alternative random variables, can be extended to conditional 
forms. In general, all expressions such as -ff(x), H{x.\y), /(x;y), /(x;y|z), /(x;y|z ~ zq) are interpreted as their respective 
probabilistic counterparts H{X), H{X\Y), I{X;Y), I{X;Y\Z), I{X;Y\Z = zq) where {X,Y,Z) are random variables 
distributed according to the empirical distribution of the vectors P(x,y,z)- Equivalently {X,Y,Z) can be defined as a random 
selection of an element of the vectors i.e. {X,Y,Z) — {xi,yi,Zi),i ~ U{1, . . . ,7i}. It is clear from this equivalence that 
known properties of these values, such as relations between mutual information and entropy, non-negativity, chain rules, etc, 
are directly translated to relations on their empirical counterparts. 

In particular, we can write the empirical mutual information as: 

J(x; y) = iJ(x) - i/(x|y) = H{^) + H{y) - i?(x, y) (69) 

Writing the entropies as the self information under the empirical distribution we have: 

/(x;y)=i^(x) + ff(y)-iJ(x,y) 

= logp(x) logp(y) + - logp(x,y) .^r.. 

1, P(x,y) 1, p(x|y) 

= - log ., .., . = - log -— — 
"- P(x)p(y) n p(x) 



Note the similarity to the form ( [40| l. 

B. Maximum likelihood based rate functions 



1) Rationale: In Section V-D we observed that attainable rate functions are asymptotically limited by the form 



1 P(x|y) 
Pomp x,y =- log ^^ (71) 

Let us assume that there is a probabilistic model relating y to x, and P(x|y) is the true conditional probability resulting from 
this model. In this case the value i(x,y) = log X/^J is termed the information spectrum or information density Iil2i (1.5)], 
and we have that the mutual information between the input and output vectors is 

/(X;Y)=^E^*(X,Y) (72) 
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As noted by Han and Verdii lfT2l . for general models (not necessarily i.i.d. or ergodic), the mutual information /(X; Y) is not 
necessarily an achievable rate, and their characterization of channel capacity in this case relies on the "liminf in probability" 
of - • i(X, Y), which means the maximum value a such that the probability that - • i(X, Y) < a tends to as n -^- oo. In 
other words, achieving a rate R requires that in high probability i(X. Y) > nR. 

Setting the rate function as the normalized information density of a specific probabilistic model, i.e. i?omp(x, y) = -i(x, y), 
is advantageous, especially when this rate function is attained adaptively, since this means that on average, the communication 
rate would be Ei?emp = -Ei(X, Y) — -/(X, Y). For general models, and with the suitable prior (5(x), this value may be 
is larger than the Han-Verdu capacity (which a lower bound in probability of i rather than its mean). This occurs due to the 
use of feedback for rate adaptation. As an example, suppose a non-ergodic binary channel may be in one of two states, which 
are determined by a single random drawing with equal probabilities - either the output equals the input for j — 1 , . . . , n, or 
it is independent of the input. Clearly, no positive rate can be guaranteed on this channel, but if one allows the rate to vary, 
we may achieve a rate of 1 [bit/use], \ the time, and thus a rate of \ [bit/use] on average. 

If we attain the normalized information density _Romp(x,y) = - • i(x, y) adaptively, then not only we attain the mutual 
information on average, but we also attain a rate of at least the liminf in probability of i(x, y) with high probability (the later 
value becomes the channel capacity if the input distribution Q(x) is optimized). Another rationale for choosing - log XrV as 

the rate function, is that we know from Theorem 5 that asymptotically the rate function is bounded by i?cmp — \ log q}^\ for 
some conditional distribution /(x|y). If one assumes that the channel model truly induces the conditional probability P(x|y), 
then the average rate would be Ei?cmp = ^ X^x v -P(^|y)-P'(y) ^Qg cKx) which is maximized when /(x|y) = P(x|y). I.e. 
when the channel induces P, any choice other than P in the numerator will degrade the achieved rate, while choosing P 
attains the mutual information. So far, we have justified why it makes sense to choose the rate function ^ log q}V if the 
channel is assumed to be known. 

However, the main motivation for the individual channel framework is to avoid the probabilistic model. One possible approach 
is to guarantee a rate close to the information density, for a class of models. Let Pe(x,y) e G be a class of models for 
joint probability of the vectors x,y. We denote by Pe(x), Pe(x|y) the marginal and the conditional distribution resulting from 
Pe(x, y). Then a possible rate function is the maximum normalized information density over all models in the family. 

1 1 fe(x|y) _ \_ maxgse^e(x|y) _ \_ PML(x|y ) 
sTe n °^ Q(x) " n °^ Q(x) ~ n °^ Q(x) 

Clearly, attaining this rate function guarantees attaining the above properties (the mutual information rate on average and the 
liminf in high probability) for all channels in the family. The family of distributions may be constrained to have V^ : Pq (x) ~ 
Q(x) but this is not necessary, and it is sometimes more convenient to avoid this constraint. However we assume that there 
exist B such that P6i(x) = (5(x) and therefore (|73| includes maximization over information densities (and possibly other 
values which are not legitimate information densities, but are still achievable rate functions). In this case the d achieving the 
maximization in the numerator would not necessary yield the "correct" marginal P6((x) — Q(x). 

To summarize, we have seen that attaining the ML-based rate function (|73| is advantageous. In the sequel we analyze the 
intrinsic redundancy associated with this rate function, and show how it can be achieved adaptively in many cases of interest. 
However we must note that there is a gap between the justification for this rate function, and what attaining it actually yields. 
In justifying this rate function we have analyzed the behavior in the case that the relation between x and y is governed by 
a probability law from a given class, however the system attaining i?cmp of (|73| will not only guarantee this behavior but 
guarantees a certain rate and error probability for each pair of sequences (which is more than required to obtain the target of 
achieving the mutual information rate for all channels in the class, using feedback). Therefore we should not treat this system 
as the best system attaining the mutual information rate, but rather as a system attaining the Pomp of ( |73| l per each pair of 
sequences, where this Pemp on one hand guarantees a certain behavior when x, y are governed by a probability law from the 
class, but also guarantees some computable rate when a different probability law is applied. This may be compared against 
a system which attempts to learn by measuring the channel, and may also attain the mutual information rate, but does not 
give any guarantee on what occurs when another probability law is applied. 

2) Intrinsic redundancy: For finite classes, it is e asv to bound the intrinsic redundancy of ( |73] l. Since the intrinsic redundancy 
of - log "XlV is non-positive (see Section 



P^^p = max - log ^1^^^ = - log j^^^ = - log ^,^.^ (73) 



V-D I, according to Property 2 of the intrinsic redundancy (Section 



IV-Ai, the 



Q(x) 
intrinsic redundancy of P^mp is at most °^}^ ■ Therefore we may allow the size of the class to increase with n, and as long as 
this increase is sub-exponential, the intrinsic redundancy A'Q(^cmp) would tend to with n, and therefore P^p of ( |73| l would 
be asymptotically achievable. However, as we shall see, ( |73] l may be asymptotically achievable even for infinite parametric 
classes as long as suitable smoothness conditions hold. 

The size of the model class yields a coarse estimate for the intrinsic redundancy of ( |73| l. A finer analysis is by relating the 
intrinsic redundancy to the regret of a universal distribution representing the model class {Pg(x|y)}. In universal source coding 
of a family of sources with distributions P6((x), one seeks a single distribution P(x), which approximates all distributions 
in the class, up to a certain loss TZ{6,x,P) = log p(^? , termed the "regret", which represents the difference in encoding 
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lengths when P is used, compared to when Pe{^) is used |T|. The minimax regret 7?.niinimax — miiip max6/,x'^(^,x, P) is 
the minimum value of the worst case regret over all models 9 and sequences x. 
It is easy to show IjTj that the distribution P which achieves the minimax regret is 

_ maxePe(x) _ Pml(x) 
}_^~ maxe Pg (x) }_^. p„^ (x) 

This distribution is simply a normalization of the super-probability Pml(x) (which we would like to approximate by a 
probability), and is termed "Normalized Maximum Likelihood" (NML). The regret is determined by the size of the normalization 
factor 

log — = log Cnml (75) 

-'NML V-Xj 



where 






Pml(x) (76) 



The fact Pnml is minimax optimal is evident by observing, that Pnml is required to be the closet probability that approximates 
the superprobability Pml (in a logarithmic minimax regret sense), and a normalization by a constant factor, which yields a 
constant regret is best, since decreasing the factor at any point would necessarily require increasing it at other points, thus 
increasing the maximum regret. The resulting regret was analyzed by Barron, Rissanen, Yu and others and is known up to 
negligible terms in many cases of interest. For continuous parametric families, where is a vector of size k it was shown 
by Rissanen IIT31 Theorem 1] that under certain conditions, there exists P having the following regret, determined up to a 
vanishing factor: 

Vx, : TZ{e, X, P) = log ^^ = ^ log ;^ + log / ^7^^^ + o„(l) (77) 

P(x) 2, Ztt Jq 



where 1(6) = lim„^oo ^E ^ lnPe(x 



is the limit of the normalized Fisher information matrix. Since this value does not 

grow with n the main factor in the regret is | logn, which is the penalty associated with the "richness" of the class. Rissanen's 
conditions are sometimes limiting. As an example, they do not hold for the class of memoryless sources where 9 is the vector 
of letter probabilities, at the boundary of &, i.e. when one of the element of 6* is or 1, since the Fisher information is 
infinite at these points. One solution is to apply the result only to the interior of 8 and account for the boundaries separately. 
However specifically for the class of memoryless sources, there are explicit expressions for the regret, with the same behavior 
as determined by ( fTTj i. See Section VII-F2 in the following for a more detailed discussion of the memoryless and conditional 



cases. A conclusion from ( |77] l is that the minimax redundancy of the NML, which is optimal, satisfies 

n{e, X, f„ml) - log chml ^ 2 ^°^ ^ ^ ^°^ I ^1^(^)1^^ + °"(i) ^^^^ 

Returning to our problem we begin with a general analysis of the intrinsic redundancy of i?^p assuming that the conditions 
for ( |77j i hold. For each y separately, we form a distribution P*(x|y) on x which has a bounded regret with respect to the 
maximum likelihood probability pML(x|y) (one option is the NML). By ^T7\ we have that 

Vx, y : log '""^^^^^^^ < \ log ^ + log|^ ^\hie)\de + o„(l) = \ logn + 0„(1) (79) 

where here the asymptotical Fisher information matrix / may, in general depend on y. Now writing 

Since P* is a probability distribution, the first term has a non-positive intrinsic redundancy (Lemma [3]l, and therefore by the 
additivity of intrinsic redundancy, i?omp has intrinsic redundancy of /iQ(-Rcmp) — | ' "^^ + 0„(l/n). 

Note that although the intrinsic redundancy obtained here has a similar form to the minimax regret in universal source 
coding, the number of parameters k will be in most cases larger due to the conditioning on y. As an example, to model all 
i.i.d. sources over alphabet X one needs jA"! — 1 parameters to define the letter distribution {\X\ letter distributions, and a 
constraint on the sum). To model all memoryless distributions P(x|y) = n"=iP(^il2/i)' one needs \X\ — 1 parameters for 
each value of ?/,; therefore k — {\X\ — 1) • |3^| parameters. 

3) Universality over a set of probabilistic non-ergodic channels: . 
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C. Variations on the maximum likelihood construction 

1) The doubly maximum likelihood construction: In the maximum-likelihood construction proposed above ( (73] l the rate 
function depends on the prior Q. It is sometimes convenient to avoid the specific dependence on Q by replacing Q(x) it with 
the maximum-likelihood probability Pml (x) of the sequence x. 

Since we assumed there exists 9 such that Pe{x.) = Q{x), we have Pml(x) = maxg Pe{^) > Q(x), therefore we have: 

1 i„„ PML(x|y) < 1 j^ , PML(x|y 
Pml(x) ~ n (3(x) 

Therefore if -R^j'jp is achievable (in any of the senses), i?^* is achievable as well. Rf^' is sometimes more convenient to 
use since it does not include the prior Q in an explicit form, and may be suitable for a large class of priors. The empirical 
mutual information as well as other rate functions presented in [Tl, [31 are of this form. In the examples in Section VIII 






we usually use the form i?cJ^p for analysis and present the two forms i?^j,'^ , i?™* for each case. It can be observed from 
Table that -R^p has a more intuitively appealing form. The rate functions of this form are inherently sub-optimal, since 
they are in general uniformly inferior with respect to the respective i?oJ^p, but this sub-optimality is insignificant since it is 
expressed only when the maximum likelihood probability significantly differs from the actual one. In most cases, if x is a 
typical sequence, then the maximum likelihood estimate will be close to the true value, and the empirical probability will be 



close to the true one, and therefore the difference is insignificant for typical sequences. As we argue in Section |VI-E2| the 

ML 

cmp 



main interest should be on the values of the rate function for typical x, therefore in many cases the difference between i?""" 



and i?^p is immaterial. 

2) The use of universal distributions: As we have seen, the maximum likelihood probability, after being normalized, yields 
the NML probability measure which is close to any distribution in the family. In general, one may define other such "universal 
distributions" based on similar or different criteria, and define the rate function as: 

1 P„(x|y) 

^"-=n^°S'^(xr ^''^ 

where P^ is universal conditional probability. Similarly as done in the previous section, (3(x) may be replaced by a "close" 
universal distribution P„(x), however since in this case we do not have the inequality Pii(x) > (3(x), a bound on ^ may be 
required to show the modified rate function is achievable. 

D. Entropy based notation for maximum likelihood rate functions 



It is intuitively appealing to write i?^p and i?^* as a difference of entropies. Using the definitions from Section 



VI-A 



-ffML(x|y) = — logpML(x|y) 
n 

FmlCx) = logpML(x) 

n 

Hq{^) = --iogg(x) 

n 



we have in analogy to I{X; Y) = H{X) - H{X\Y): 

RZp ^ Hq (x) - T^ML (x|y) (83) 

i?ri; ^ ^ML (x) - K^ (x|y) (84) 

In many cases the empirical entropies above have an intuitive interpretation as a measure for the complexity of the vectors. 
When the equality in ( |67| l holds, we can write the rate function i?omp ii^ ^ symmetric form: 

C,-;; - - log . ^""^'':^^, , = i?ML (x) + i^ML (y ) - ^ML (x, y ) (85) 

E. Rate functions defined by given empirical parameters 

Now we examine a different construction for rate functions, relying on a parametric representation of the input and output 
sequences. For example, in 1 1 , Lemma 3] we have justified the rate function ^ log y^^j for the continuous real-valued channel, 
as the best rate function defined by second order statistics, in a compound channel setting. More generally, suppose that we 
decide on a certain empirical parametrization of the sequences x, y (e.g. zero order empirical statistics, empirical second order 
moments, etc), can we find the "best" rate function that can be defined using this parametrization? 
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Let 9{x, y) e be an predefined estimator of a parameter vector 6 E Q, and let Q be a predetermined prior. We limit our 
scope to rate functions defined as: 

i?cmp = i?(^(x,y)) (86) 

where R{0) is a function of our choice. Given 9, Q we would like to find the maximum R{0) for which Rcmp would be 
achievable. 

1) Optimal rate functions over types: An alternative formulation of the problem is to say that the set of sequences (x, y) e 
A"" X y^ is separated into disjoint sets, termed types, Txy C (A"". 3^"), and the rate function is required to be a function of 
the type Rcmp — R{Txy)- This formulation is equivalent to the former, since we may define the types as the sets of sequences 
that yield the same value of the parameter, i.e. Txy{d) — {(x, y) : 6'(x, y) = 0}. However, we now further constrain ourselves 
to the case where the number of types is finite (equivalently, the set of possible parameter values is finite). This assumption 
is more suitable to the discrete case, since when the sequences x, y are discrete, the number of possible parameter values, for 
a certain block length n is finite. 

As an example, suppose that the parametrization is by the zero order empirical statistics. In this case 6' is a vector comprised 
of the \X\ ■ \y\ elements of the empirical probability Px.y{x,y). Since each element of the empirical probability is in the set 
in} -0' there are at most Nt < {n + l)l'^ll^l values. Alternatively, the types defined by the sets of sequences with the same 
value of the parameter, i.e. the empirical distribution, are in this case the regular types defined by Csiszar |8||14 Chapter 11], 
and the number of types is bounded by N^ above. The concept of types was generalized in various ways lH Sec. VIII ifTSl . 
However currently we do not assume anything about the structure of the type classes, and they can be arbitrary sets of pairs 
(x,y). Our only assumption is that the number of type classes is finite and upper bounded by a given value, denoted Nt- 

We begin with an upper bound on the rate function. We denote by 7ij,(x, y) the type class associated with a specific pair of 
sequences. Consider a specific type T^y If (x,y) € T^y (i.e. 7iy(x,y) = Tj^), then i?omp(x,y) = R{%y{x,y)) = R{7^y), 
therefore for a specific y, 

Pr{i?e„p(X,y) > R{7^y)} > Pr{(X,y) G 7^^} = Q {x : (x,y) e Ti",} - Q{r4(y)} (87) 

where we have defined the conditional type T^. (y) = {x ; (x, y) e TJ?„} (in an analogy to the regular definition of conditional 
types [8^ Lemma II. 3]). On the other hand, by Theorem fl] if i?omp is achievable then 

Pr {i?emp(X,y) > R{%'y)} < (1 - e)-i exp{^nR{r°y)) (88) 



Combining the two inequalities ([87|i,(|88|l we have that 

Vy : Q {r4(y)} < (1 - e)-i eM-nRiT^y)) (89) 



I.e. 



R{Tl) < -^ suplogQ {r4(y)} + ^ log ^ 



(90) 



For large n, the second term in the RHS of (|90J tends to and the first term is therefore the dominant one. Note that for 
vectors y that do not appear in TJ? , T3 (y) is an empty set, and therefore these do not affect the supremum and can be 
removed. 

We now show that the first term in the RHS of (|90| indeed leads to an achievable rate function if the number of types is 
not too large. Let the rate function be defined as: 

R{%y) = --suplogQ{r,|y(y)} -<5 (91) 

n y 

From ( |9T| l we have for any y: 

Q {%\yiy)} < eM-n{R{%y) + S)] (92) 



For any y and R e 

„,p(X,y)>i?} = 

J^Pr {(i?(r.,(X,y)) >R)n (r.,(X,y) = 7^J} 



Pr{i?emp(X,y) >R} = Pr{R{%y{:^,y)) > R} 



7-0 

J2 r'U^^y)'^ 'xyj - Z^ ^l'x\y\y)f (93) 



Pr{(X,y)er.oj= ^ Q{r4(y)} 



^ Y, ex^v[-niRi%y) + 5)] < ^ exp[-~n{R + S)] 

TSy--R{TSy)>R TSy.R(rSy)>R 

< Nt ■ exp{—nd) ■ exp(— ni?) 
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(c) The set of all y in the class (projection) 




(a) The specific pair (x, y) 
(b) The type class of (x, y) 



Input sequences X" 

(d) The largest probability conditional type Tx\y{y) 
(e) The specific y yielding the largest probability conditional type 7^|„(y) 

Fig. 5. An illustration of the calculation of type-based rate function by Theorem l6] 



Therefore in order to satisfy the sufficient condition of Theorem \l\ it is sufficient to require Nt ■ exp{—nS) < e, i.e. S = 
ilog^. 
We summarize these results in the following theorem. 

Theorem 6. Let Txy denote a set of no more than \'^xy\ ^ Nt disjoint sets (types) covering the set of sequences X^ x y^. 
For two sequences (x, y), let Txy S '^xy denote the type containing these sequences, let Tx\yiy) = {x : (x, y) G Txy} denote 
the respective conditional type. For a prior Q on A"" define the following rate function: 

^cinp(x,y) = --suplogQ{7;|y(y)} (94) 

n y 

Then for a prior Q and an error probability e." 

1) Any achievable rate function which can be written as a function of the joint type Txy of the sequences x, y (i.e. the set 
Txy such that (x, y) G Txy), can exceed Rcmp by more than - log j^ 

2) Rcmp is achievable up to 5 = h log -^ 



3) Furthermore, if the number of types increases subexponentially with n, i.e. - log Nt 
achievable. 



0, then i?omp Is asymptotically 



and the second in (|93]l). The last claim results 

D. 



Proof: the proof is given by the derivation above (the first claim in 
trivially from the second, since under the assumption, 5 — s- 0. 

The calculation of the -Rcmp proposed above ( |94| ) is illustrated in Figure d1 The axes lines denote the set of all sequences 
X e A"" (horizontal) and y e 3^" (vertical). For the specific pair (x, y) for which the rate function is computed (a), the polygon 
(b) depicts the type class this pair belongs to (any arbitrary sub-group of pairs). All y in the type class (c) are scanned, to find 
the one (e) yielding the maximum-probability conditional type (d), illustrated by the maximum horizontal width in the figure. 

The main gap between the upper bound and the lower bound of Theorem [6] is due to Nt - the number of types. This gap 
is essentially unavoidable (when types are considered as general sets), since it is possible to construct a rate function that will 
nearly meet the necessary condition for one type (up to the gaps resulting from Theorem [T]), by placing all the probability on 
that type (i.e. having i?oinp — for all other types). In this case the bound of ( (87] i becomes tight for this type. 

2) On the optimality of the empirical mutual information: We now particularize the result of Theorem l6] for the memoryless 
model, in which 6 is the joint empirical distribution, and the types Txv are the standard types |8|. We use the coarse upper 
bound Nt — (n + l)l'^ll^l [14. Theorem 11.1.1] (see also Section VI-El i. We assume that (5(x) is also memoryless, i.e. 

Q(x) = nLiQ(^0- 

Consider two sequences (x,y) having an empirical distribution Px,y and belonging to the type class Txy For notational 
purposes, we denote by X,Y dummy random variables, distributed according to Px.y(a;, y). 
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The size of the conditional type is |7^|j^(y)| — Cnexp{nH{X\Y)) for any y e Ty, where c„ is a subexponential factor 
^S-^ — )■ ISJ Lemma II. 3] (for other y-s it is zero). All sequences in the conditional type are of the same type Tx, and 



therefore have the same probability under Q, which is easily shown to equal (3"(x) — exp[—n{H{X) + D{Px\\Q))] f8^, (II. 1)]. 
Therefore we have for all y G Ty'- 

Q{r.|,(y)}-ir.|,(y)l-g"(x) 

= c„ exp(ni/(l|f )) exp[-n{H{X) + Z?(Ac||Q))] 

= c„ cxp[-n{HiX) - H{X\Y) + DiP^\\Q))] (95) 

= c„ exp[-n{I{X; Y) + D(4||Q))] 
= c„exphn(/(x;y) + D{PjQ))] 
Hence, the rate function defined by Theorem |6] in our case is: 

i?emp'2J(x,y) = -lsuplogQ{r,|^(y)} = i{x;y) + D{P^\\Q) - ^^^^ (96) 



n 



where -^^^ is asymptotically vanishing. According to Theorem 6 this is the optimal rate function defined using types, up to 
asymptotically vanishing factors. Since °^'^" is also asymptotically vanishing, the conclusion is; 

Lemma 5. The following rate function: 

i?emp(x,y) = i{x-y) + D{P^\\Q) (97) 

is the maximum rate function defined by zero-order statistics (equivalently, joint types) which is asymptotically achievable. 

Note that we have used the term "maximum rate function asymptotically achievable" in a somewhat loose way. What it 
actually means is that i?cmp(x, y) can only be improved by asymptotically vanishing factors. 

This result shows that formally, perhaps contrary to intuition, the empirical mutual information is not the asymptotically 
optimal rate function defined by zero order statistics: the above rate function is uniformly better. 

However considering this from another perspective, we may argue that this difference is immaterial. The rate function above 
( |97| i significantly exceeds the empirical mutual information, due to its second term, only when x is non typical, i.e. when the 
empirical distribution of x significantly differs from the prior Q. Since x is fully controlled by the encoder and has a known 
probability distribution (as opposed to y), increasing the rate for non-typical x does not give any actual gain, since we know in 
advance these events are rare |8, Theorem 111.3], irrespective of the channel behavior. In other words, rate functions should be 
compared mainly based on their values for the typical set of x sequences. Considering this perspective, we may interpret the 
result above as essentially proving the optimality of the empirical mutual information, as it aligns with the above rate function 
for the typical x. For any rate function asymptotically improving over /(x;y) (and still bounded by (|97]l), the improvement 
may happen only for non-typical (and thus, low probability) x. Furthermore, it is impossible to have a non-vanishing gain 
over /(x, y) for all sequences, since this would imply improving over ( |97] l for sequences with Px = Q. Therefore we may 
conclude that the empirical mutual information is "effectively" optimal. 

The fact that, strictly speaking, the empirical mutual information is not optimal is not su rprising if one recalls (fTOll that 
/(x;y) — ^ log ^-^x) ' ^'^'^ therefore it is of the suboptimal form RH^* defined in Section 



VI-Cl 



Indeed, replacing p{x) 
by (5(x) we obtain a rate function of the maximum likelihood form ( |73| l which equals the asymptotically optimal function 
presented above ( |97| i: 

l,_p(x|y) 1 _p(x|y) l,_p(x) f^ 1 "p^(3:,,) 



, py^y J^ 1 p\.^y , ^ , py^ t, ^ , J^ i tt 



n (3(x) n p(x) n (5(x) ' n ^J- Q{xi) 

- . '" (98) 

= /(x;y) + -Y1 ^^x(i)log^^ - /(x;y) + D{P^\\Q) 

This observation strengthens the motivation for the maximum likelihood construction (|73|, as we have now seen that in addition 
to the properties mentioned in Section [VI-B| this construction yields an asymptotically optimal rate function in the memoryless 
case. 

A way to understand the reason that ^ log q}^} is optimal is as follows: since we are looking for an asymptotically 

achievable form we consider only rate functions of the form i?cmp = \ log qT'^^I (the asymptotically limiting form, by 
Theorem |5]l. Further constraining the rate function to be a function of the empirical statistics brings us to consider only 
memoryless P and Q (note that this is not a necessary condition !), i.e. we have 



xe.np = - E i°g -Q(-^ = E p^\yi-\y)Py(y) log -Q^ 



i?emp = - > , log X: : = 2^ P.iy{x\y)Py{y) log ^^^. (99) 

x,y 
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This leaves us with the problem of choosing P. Since for every specific sequences x,y, J^x -^x|y(2;|y) logP(a;|y) < J^x ^x|y(a^|?/) log-Px|y(; 

Rcmp is upper bounded by ^^ P-x_\y{x\y)Py{y) log "ofa) ~ « ^°S ofx) ' ^'^'^ °'^ '■'^^ other hand as we have seen this rate 
function is achievable with an asymptotically vanishing redundancy. 

F. The rate of a given decoding metric 

As already mentioned, every single user communication system can be characterized by a rate function - one could always 
"freeze" the channel and observe how the system performed (in terms of rate and error probability) over all instances in which 
a specific x was the input and a specific y was the output. Having characterized the system in this way, we may now consider 
how it operates over any channel of interest. In the particular case of random encoders and metric based decoders, explicit 
expressions for a rate function from a given metric and an input distribution can be given. Then, these expressions can be used 
in order to compete against a class of systems, defined by different decoding metrics. 

We now consider the specific case of a random i.i.d. code and metric based decoder This class of systems was selected 
since it allows a relatively simple analysis. On the other hand, this class of systems is able to attain the information theoretic 
bounds with respect to rate and error exponent (where the later is known to be tight over part of its domain |8|). This class 
was used as a comparison class by Ziv |16|, and the current derivation is inspired by the analysis performed there. 

The code is a random i.i.d. selection of M — ex-p{nR) codewords from a predetermined distribution Q(x). The decoder 
uses a decoding metric u(x,y), and after seeing y, chooses the word with the highest value of u(x,y). Note that the system 
attaining the sufficient condition of Theorem [T] also belongs to this class. With this metric and input distribution, under different 
channel assumptions and error probability requirements, one can obtain various feasible rates, i.e. the maximum rate in which 
the system can operate with the required error probability under the channel model. Now, we would like to avoid specifying 
the channel model and error probability, and say something about the rate possible with this metric for given input and output. 

Given that the word x was transmitted and y was received, a decoding error would happen if any of the other words has 
a metric value higher than the metric of the transmitted word. The probability of any word having a metric value exceeding 
that of the transmitted word is 

p(x,y)^ Pr fw(X,y)>u(x,y)| (100) 

The probability of any of the M — 1 competing words exceeding the correct word is 1 — (1 — p(x, y))*^^^ and since this is 
a sufficient condition for an error we have that the conditional error probability is; 

Pe\xy{^,y) > 1 - (1 -P(x,y))*^-1 (101) 

Note that this bound is tight up to the question of how ties are broken: it is an inequality only since we do not know if errors 
occur when u(x,y) = m(x, y). If we had defined p(x,y) by the event M(X,y) > M(x,y) then we would have an inequality 
in the other direction. In the following, we sometimes omit the arguments x, y and use p, Pe\xy instead of p(x, y), Pe|^y(x, y) 
(etc). 

We may now ask the following question: given a specific pair x, y, how many codewords could one allow, while reaching 
a small probability of error? Using ( |101| l and requiring Pe\xy < e we have: 

l-(l-p)^^"i <e (102) 

logfl - e) 

M < -^^ ( + 1 (103) 

log(l - p) 

Note that both log(l — e) and log(l — p) are negative. Assuming e < ^, — log(l — e) < log(2) = 1, and in order for A/ to 
be large M >> 1, — log(l — p) is required to be small (<< 0) and therefore p needs to be close to 0. Therefore we may 
approximate — log(l —p) ^ p. If one also assumes e w 0, the bound above can be written as w - + 1. Interestingly, if we had 
used the union bound to calculate the error probability, we would need to require p ■ {M — 1) < e, which would also mean 
M — - + 1. Here we can see in a simple way why for the purpose of determining the rate when the error probability is small 
and fixed, the union bound is tight. 

Given p, e, it is possible to define the rate function ^ log M where M satisfies ( |103| l with equality, i.e. 

1 / l0g(l ~ 6) ^ \ ^^ 1 

n Vlog(l-p(x,y)) y ~ n'"" Vp(x>y). 

This yields a way of converting decoding metrics to rate functions. It is interesting to observe that p(X, y) is uniformly 
U[0, 1] distributed. This is because since for each y it equals the inverse CDF 1 — Fu{u) of the random variable [/, defined as 
U = u(X,y) (where X ^ Q). Hence Fu{U) is uniform 1LJ[0, 1]. Also, per y, p(x,y) is decreasing in u(x,y), and therefore 
decoding with the metric .J . is equivalent to decoding with m(x, y). Similarly i?cmp can be used as a metric. It is interesting 
to note in this respect that if one wants to supercede the performance of K systems with metrics Mfe(x, y), fc = 1, . . . , iiT, by 



i?cmp = r log I i_,, „,,. ,.^^ + 1 ) « r log I rTITTTY ) ■ (104) 
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taking the maximum over their respective i?cmp ( |104[ ), the resulting metric is equivalently the minimum over pfc(x, y). This 
yields the "Merged decoder" of Feder-Lapidoth [?] (p reflects the order over X defined there), and it is easy to see by the 
union bound (still, conditioning on x, y) that indeed the error probability is at most the sum of individual error probabilities. 
p can be considered a canonization of u: while u is of general form, \/p is an equivalent metric, constrained to a specific 
distribution. 

While \IQ2>) gives us a relation between the probability p(x, y) and the rate -R(x, y) = ^ log AI, it requires a specification 
of e, the error probability. Most communication systems are not designed to yield a guaranteed same error probability per 
each X, y, but rather on average (actually, it is impossible to have fixed M and obtain a given error probability uniformly). 
Therefore instead of considering the rate with a given error probability, we mix the two together and consider the "goodput". 



i.e. the average number of error-free bits per channel use, which is defined as (1 — Pg) ■ R (see Section IV-Ei. Given x, y the 
question is, what is the maximum good-put that can be achieved. 
Define 

i?^*„„,(x,y)= sup (1-P,|,^(x,y)).i? (105) 

Af=2,3,... 

where R= - logM. I.e. for given Q and ?i(x,y), i?g„od(x, y) is the maximum error-free rate conditioned on x, y which can 
be obtained with any number of codewords, considering the tradeoff between rate and error probability. Although i?g„od(x, y) 
is a function of x. y, the goodput of any fixed rate system (where M is a constant) conditioned on seeing a specific pair x, y 
is also bounded by i?go„d(x, y) (due to the supremum with respect to M above). Note that the original system possibly had a 
certain fixed rate, which we now ignore, since we look at all possible systems using the same metric. 

We now compute an upper bound on i?g„„d(x, y) by using the bound of \\Q\\ and by relaxing the maximization over M to 
[2,cxi) (not necessarily integer). 

i?;..(x,y)? sup {l~pf'-^--\ogM. (106) 

A/e[2,oo) n 

Writing M as M = _j^" ^ + 1, then In [(1 - p)^^"^] = (M - 1) ln(l - p) = -a, and therefore 
{l-pf-^\ogM^e 



1 M 



<log^ r+e-"loga-Hog(2). (107) 



It is easy to bound f{a) = e " Ina < e ^, by writing /(a) < e "(a — 1), and showing that the maximum of this bound is 
obtained for a = 2. Defining c = log(2) + e~^, \\(fl\ yields: 

{1~pY'-^ ■ - logM < - log f — -^^ .) + -, (108) 

n n \— mil — p) J n 



i?:„„a(x,y) < - log i^- r ) + -, (109) 



and therefore 

hence, also the good-put function is bounded asymptotically Uke - log -, and not far from i?omp- This is not surprising, since 
for any e, when trying to exceed the rate given by i?cmp^ the error probability quickly increases to close to 1 and the rate 
drops. Therefore i?cmp cannot be exceeded significantly, even when the error probability constraint is removed. This implies 



that i?*^^^(x, y) is asymptotically achievable as a rate function, which corresponds to what was shown in Section IV-E (there 
it is shown for general systems, not necessarily with i.i.d. random-coding, but without the maximization on M). 

The discussion above shows how to construct achievable rate functions from decoding metrics. Furthermore, if one has a 
set of reference decoders, with possible decoding metrics, by choosing the maximum resulting rate function, it is possible to 
guarantee a better rate than all the reference decoders. In the non-adaptive case, the meaning of guaranteeing a better rate is 
that if the universal system operates with rate R, then for any x, y for which any of the reference systems yields a rate (or 
good-put) larger than R, the universal system will succeed, with high probability, to decode. 

It is interesting in this respect to consider, for a specific i.i.d. input distribution Q, the family of decoders using memoryless 
additive metrics, i.e. u(x, y) = X]r=i ^(-^^'J^i)- Clearly, the rate ( |104[ ) or good put ( |105[ ) functions attainable by this family 
of decoders is independent of the order of the letters in (x^, yi), i.e. they depend only on the empirical distribution of (x, y), 
and are therefore asymptotically limited by the rate function ( |97] i /(x;y) + D{P-x\\Q) defined in Lemma pi Hence, if one 
considers the maximum rate over all decoders in this family, this rate would still be lower than the rate function of Lemma [5] 
On the other hand, as shall be seen, the rate function of Lemma |5] is asymptotically adaptively achievable. As a result, there 
exists an adaptive rate system, that for any x, y attains a rate at least as large as the rate that could be attained by with the 
best memoryless decoding metric. This universality would also hold true when y is determined by a probabilistic channel and 
the rate is taken on average over all pairs. 
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VII. Rate adaptivity 



In Section |V-D| we have shown that, asymptotically, all attainable rate functions are limited by the form 

1 P(xlv) 
i?cmp(x,y) = -log-^ (110) 

In this section we will present a rate-adaptive scheme that attains this rate function adaptively for many conditional distributions 
P(x|y), but not for all. Generally, the requirement is that P(x|y) could be computed sequentially, while y is gradually revealed 
to the decoder. Unlike in the non-adaptive case, we do not have an asymptotical characterization of all achievable rate functions. 
Furthermore, we do not have tight bounds on the redundancy required to achieve these rate functions. However, many rate 
functions of interest can be posed in a sequential form, and therefore, as we shall see, there are many examples of interesting 
rate functions which are adaptively achievable. 

Before presenting the scheme, we would like to begin with a more fundamental question: why is feedback needed to yield 
rate adaptivity? 

A. A rate adaptive scheme 

The scheme proposed in order to achieve rate adaptivity (see definitions |5|6| l is based on an iterative application of rateless 
coding, and is similar in concept to the one used in the previous paper il|. The idea of iterative rateless coding was first 
proposed by Eswaran et al |4|. We fix a number K of bits per block. At each block, the encoder transmits symbols from the 
codeword selected based on the message bits. The decoder examines the channel output, and decides when it has "enough 
information" to decode, according to a termination condition. When this condition is satisfied, the decoder sends an indication 
through the feedback link, and a new block begins. In the new block, additional K bits from the message string will be sent. 
The process ends at time n, and the last block is possibly not decoded. Thus, the rate varies by changing the number of blocks 
transmitted. Roughly speaking, as the rate function increases, the blocks become shorter, and the number of blocks increases. 

We assume the feedback is completely reliable, but may have a limited rate and a delay. In order to model the effect of 
limiting the feedback rate, we define that a feedback of one bit is possible only once per dps > 1 symbols and has a delay of 
dfB symbols, i.e. the decoder may send a feedback bit only on symbol i ■ dpg + 1 (i = 1, 2, . . .), and this bit will be seen by 
the encoder dp^ symbols later at time (J + 1) • dpg + 1. 

Let Q{x.) denote the input prior Suppose a block ended at symbol j, then the codebook of cxp{K) codewords for the 
new block starting after this symbol is generated by random i.i.d. selection of each codeword, according to the distribution 
Q(x"_|_]^|x-') = ^fe4, where x-' are the symbols that had already been transmitted. This guarantees that irrespective of the 
message, the input distribution remains Q. The randomization is carried out by using the common randomness. Under the 
assumption that there are no decoding errors, the decoder knows x^ and using the this codebook is known at both sides of 
the link. If there are decoding errors, there may be unexpected behavior at the decoder side, however the input distribution is 
maintained Q{x.) as required. For simplicity, we always treat the codewords as vectors of length n, where all the prefixes of 
the codewords will be fixed and equal x^. The codeword that encodes message m (m = 1,2,..., ex-p{K)) is denoted x'^'"-'. 
At each block m is formed from new K bits out of the input message sequence, and the encoder sends the symbols of x^™^ 
matching the time index, one by one. 

The decoding is carried out by using a decoding metric ip{x'',y'',j) and a decoding threshold i^* f^, which are defined for 
all < j < k < n. ^/'(x'^, y'',j) is interpreted as the decoding metric at time k where the last block ended at time j. To prove 
the properties of this scheme that are given in Theorem pi some assumptions on ?/;(x'^,y'^, j) are required. Potentially, these 
assumptions are satisfied only when k — j is large enough k — j > bo, and in this case V'* will be defined as infinity for the 
first bo symbols in each block. 

The decoder decides to decode the current block at time k if 

1) fc — 1 divides by dps (i-C- there is a chance to send a feedback bit) 

2) There exists a codeword me {1,2,..., cxp{K)} such that 

^((x(-))^y^J)>^*, a id 

Note that these (x^™))^^ include a common history of length j and an unknown part of length k — j. 
If the decoder decided to terminate at symbol k, then the encoder will start a new block at symbol k + dp^. Thus for the new 
block we will have j' — k + dps — 1 (the last symbol of the previous block). New blocks always start on symbols i ■ dpg + 1 
(the first, at symbol 1). 

The scheme is defined with respect to the parameters K, ip, i/j* , bo, ^fb and is performance will be a function of these factors. 
The scheme is illustrated in Figure |6j where in the top of the figure, the division of the n channel uses into blocks is depicted. 
The blocks have an arbitrary length. At the bottom of the figure, the process of decoding the third block is detailed, showing 
the transmitted word, and the feedback delay at the end of the block. 
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Block 1 



Block 2 




Block 3 



Block 4; Blocks i Block 6 



Previous blocks 



x-* (fixed) 



Block 3 



ie — Si 



x^+i~Q(.!x^) 



Block 4 



j (for block 3) 

Decoder decides to decode 
Encoder receives indication and starts a new block 



Fig. 6. An illustration of the rate adaptive scheme 



B. The performance of the rate adaptive scheme 

The following theorem formalizes a claim on the performance of the scheme presented above, under some assumptions 
on the parameters. The theorem gives the achieved rate as a function of the decoding metric, and shows that asymptotically 
-Rcmp ~ - log?/'(x,y, 0) is achievable. This relation, as well as the conditions we define on ijj, may appear a little cryptic at 
this point. This is mainly since, in order to keep the generality of the theorem, which will make it useful in several cases later 
on, we avoid specifying il>. To better understand the theorem, it is useful at this point to think of the following substitution of 






for some conditional probability law P. In this case, it is easy to see that the rate function 



■0: V(x'',y'',i) 

defined above aligns with the generic conditional form of rate functions ( |40] i, and the other conditions on ^ will make sense. 

Theorem 7. For the channel X ^ y, a given block length n, prior Q(x.) and error probability e, and with respect to scheme 
of Section |V7/-A| operating with K bits/block, a decoding metric ip, decoding thresholds tp* and feedback delay d^^, which 
satisfy the following conditions: 

1) CCDF condition: The following bound holds for all k — j > bo > (for bo G Z+j and all y^ : 



Pr{0(x^y^j)>t|x^"}< 



t 



(112) 



For some sequence Li > 0. Alternatively, the following sufficient condition (due to Markov inequality) can be met: 



]E[^(x^y^J)|x^■] <l 



k-j 



(113) 



2) Approximate summability (convexity): Let {ib:kb}b=i ^^ '^ *^^ of B pairs of increasing indices indicating segments 
in time ji < ki < J2 < k2,...,jb < h < jb+i, ■ ■ ■ ,Jb < ks < n, where {ji,, fc,,) refers to symbols jf, + 1, . . . , h. 
Define V'o = "0(^7 y^O) '^f^d ipb = '4'i'^^'' lY^^ TJb)- I-^- one is the metric measured on the entire transmission, and the 
other is the metric for a specific segment. Let mo denote the number of symbols that are not included in any segment 
TTiQ — n — X];)=i(^6 ~ Jb)- Then there exists a function /q : M — >■ ]R+ such that the following is satisfied: 

B 



\ogi^^-Y^\ogi;b<fr{i>o)-mo 



(114) 



fc=i 



I.e. the difference between the \og-metric on the entire transmission and on the segments can be bounded as a function 
of the number of symbols not participating in the sum. 
3) Technical assumptions: 

* Li is non-decreasing in i (for i = 1,2, . . .) 
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Define 



and 



Re 



-log7/;(x",y",0)^-log^o" 



with c„ = log 'j' '^ and hi 
the threshold 




Cn +&1 • /o"'(cxp(nf)) 



if 



•i- 



K 



1. r/zen -Fn(^cmp) is adaptively achievable by the scheme of Section 
n ■ Lk-j ■ exp(K) 



VII-A 



r,.= 



(115) 

(116) 
using 
(117) 



de 
and for some sequence Sn G [0, 1], (5„ — > 0, Vt : — — ? — > ff/i/s /joWs trivially if 



,("), 



Corollary 7.1. If ^ log L,, 

(n) 

/q is upper bounded by a constant), then i?omp is asymptotically adaptively achievable. 

Corollary 7.2. If the rate function i?omp defined in \\\5\ is bounded Romp < ^max, then it is achievable up to Sn — 

3y '""^ — ^^^ — -, where /q = max /q (exp(nt)). /n of/ier wort/i, /n this case we can bound the additive loss and 



t<R„ 



(")* 



have Fn{t) of the form t — (5„. Furthermore, for small e and large n, if Jq is upper bounded for all n, this rate function is 
achievable up to sa 2a/ ^ ' . 

Corollary 7.3. Under the conditions of Theorem U\ the above rate function ( |115| l is also non-adaptive ly achievable, with an 
intrinsic redundancy o/ /iQ(i?cmp) < -login 

Note that the Theorem refers to decoding metrics satisfying specific conditions. In some cases one can modify a given 
decoding metric by add ing constants that will enable satisfying these conditions (see for example Section [VIII-E1[ ). A note 
regarding Corollary 7.2 note that in the non-adaptive case the redundancy was of the order of 6 (i), whereas here it is larger 



loo; n 



by more than a square root 8 ( \/ ^^^^ I . This relatively large redundancy is due to the fact we have divided the transmission 
into blocks and there are approximately <^{^yn) blocks. 



C. An intuitive explanation 

D. Proof of Theorem U\ 

For brevity we denote d = dps- We begin by determining the decoding thresholds that allow us to bound the error probability 
by e. We require that for any symbol in which a decision is made i ■ d,l < i < n/d, the probability of deciding in favor of a 
different codeword than the one that is transmitted is at most — , conditioned on the input sequence, and on the assumption 
there were no errors up to this point. Since there are no more than n/d such events, then by the union bound this would 
guarantee that the probability of any of these events, conditioned on the input sequence, is at most erj When any of these 
events happens, there is an error, and we do not give any guarantee on the decoding rate. When none of these events happen, 
the message is perfectly decoded, and we will be able to give a deterministic lower bound on the rate. The probabilities are 
conditioned on the input sequence since Definition [6] requires an error probability guarantee for any input and output sequence 
(the output sequence is treated as a deterministic sequence). 

We consider a decoding at time k where the previous block ended at time j. The true codeword is denoted m and the 
channel input is therefore X'^ = (X'^'"^) . The alternative codeword is denoted m. By our definition, the two codewords 
are equal up to time j (common history) and independent from time j + 1 on. Therefore in terms of the probability of the 
decoding metric to exceed the threshold for codeword m, knowing the channel input X is equivalent to knowing the first j 
elements of X'™\ In other words, given X, X'""' equals X^ to up time j and is distributed Q(|Xj) from that time on. By 
our assumption that there are no decoding errors so far, the codebook used by the decoder is correct. 

If fc — j < bo then there is no guarantee on the distribution of ip, and therefore we set ip* = oo, i.e. do not decode regardless 
of the channel output. Assuming k — j > bo, the probability of any codeword to exceed the threshold is: 



Pr{^((X(-))^y^J)>V';.Jx}=Pr{^(X^y^J)>^;,,|X^} 



m 



L 



k-3 



v- 



(118) 



j,k 



-the elements in the union are the following event: an error in the first decision, an error in the second decision given that the first is coiTect, etc. The 
union of these events is the event of any error occurring 
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Since there are exp{K) — 1 competing codewords, using the union bound, the probability that any codeword will exceed the 
threshold is upper bounded by 



Pr{3m : V((X(*))^y^ j) > ^^^jx} < exp(i^)-^ < ^ (119) 

Setting the threshold to: 

'jk-j ■ exp{K) 



^h = ^^^^ '^ ' (120) 

would guarantee meeting the error probability requirement. Note that tighter bounds can be obtained for specific structures of 
the metric, specifically when the metric is a product of single-letter metrics and Q is i.i.d., by using the methods proposed by 
Feder & Blits , and for these cases the factor n in ( |120| i could be avoided. Here we used the union bound on symbols, which 
is simpler and more general, but less tight. 

We now turn to analyze the rate. When X and y are given, and under the assumption that no decoding errors occurred, the 
decoding times are deterministic, and result in a deterministic rate. We denote by B the number of blocks, including the last 
one which is potentially not decoded. The actual rate of the scheme satisfies 

«„ > ^^^^i^ (.21, 

n 

We now use the summability condition to relate _Ract and i?cmp- Define by jb (6 = 1, . . . ,i?) the end-time of the previous 
block for any of the blocks. A typical block, which is long enough, has the following time Une: during the first 6o symbols 
the decoding condition is not checked. The opportunities to send feedback are symbols % ■ d^\ in the block (i = 0, 1, . . .). 
The decoding condition is checked for the first time in symbol [^] • d + 1 (the minimal i • d + 1 satisfying i • d + 1 > &o + !)• 
The condition may be met on this symbol, in which case the new block would begin d symbols later The block may be 
even terminated before this time, if time n arrives. However in the typical case, as depicted in the bottom of Figure |7] the 
condition is not met on this symbol, and then it is checked again each d symbols, until it is finally met. For a block 6, long 
enough, define fcf, as the last time, in which the decoding condition was checked (after location 6o) and did not pass, i.e. 
the metric of none of the codewords, including the correct one, passed the threshold. Suppose that such a fcb exists, then the 
decoding condition was met at time fc;, + d, and a new block was started at time k^, + 2d. Therefore the length of the block 
is lb = kf, + 2d — jb — 1. For a given block length 4, the condition for the existence of fcf, is that the symbol number of this 
opportunity satisfies kb — jb > bo + 1, i.e. h > bo + 2d. When this happens, the fact that the decoding condition failed at time 
kb yields an upper bound on the decoding metric, since we know that for the true codeword, we have: 

i;iX'^\y''\jb)<rk,,j, (122) 

Note that this yields a bound on the value of V' up to 2(i — 1 symbols before the end of the block: after time kb there are 
2d — 1 additional symbols which are not "covered" by this bound. For the shorter blocks, we do not have any bound on tjj. 

We divide the B blocks into a group Bl of blocks whose length is at least bo + 2d and a group Bs of blocks whose 
length is smaller. The last block may be included in one group or the other For the blocks in the first group, we define jb 
and kb as above, and we have the bound of ( |122| i. {jb, kb) are interpreted as the "segments" referred to in the suitability 
condition. We effectively split the n symbols into "constrained" symbols (contained in the segments {jb, kb)), for which we 
have a bound on tp, and "unconstrained" symbols for which we do not have a bound. The summability condition allows us to 
relate the the overall metric i/,g to the values of the metric on the segments, and the number mo of "unconstrained" symbols. 
We now count the number of "unconstrained" symbols, i.e. those that are not covered by any segment. In each long block 
there are 2(i — 1 unconstrained symbols, unless it is the last one, in which case there are at most d. And all the symbols of a 
short block, which are at most bQ + 2d— 1, are unconstrained. Therefore the total number of unconstrained symbols is at most 
Too = {2d-l)-\BL\ + {bo + 2d-l)-\Bs\. Substituting \Bs\ ^ B-\Bl\ we may write toq as toq = {bo + 2d-l)- B -bo-\BL\ 

Figure |7] illustrates the constrained and unconstrained symbols. The top of the figure shows the overall transmission time 
1,.. . ,n divided into 6 blocks. Blocks 1,4 are short, and the rest are long. The dark parts denote the segments {jb,kb) for 
which the constraint ( |122| i applies. The while parts denote unconstrained symbols, which occur on short blocks and at the last 
symbols of long blocks. The bottom of the figure illustrates the the time line of a long block, as was already discussed above. 

Applying the summability condition ( |114| l we have: 

logVo" - E logV'(X^Ny'%Jfc) < /(i"^W) ■ "^0 = /^^W) • {{bo + 2d-l)-B-bo- \Bl\) (123) 

beBL 

Substituting the threshold ( |123| ) we have: 

E logv^(x^y^,.) f E logc.. ^ E log "'^^--^"^^^''^ 

beBL beB^ beBL 



L n ■ exTp{K) 

de ' ^' \ ° de 



< E i°g ;r, = \BL\-[\og^^ + K 

beBL 



(124) 
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Block 3 - a typical long block 

d 
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Ji. 
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First check point 



First symbol of next block 



Last symbol of cuiTent block 



Stait of cuiTent block V Decotiing decision made 

k^ = last check point in which decoding condition not met 



Fig. 7. An illustration of the constrained and unconstrained symbols 



Therefore 



(123) 



log '^ +k]+ /^")(^o") . {{bo + 2d-l)-B-bo-\BL\ 



=p(|Sil) 



(125) 



The above expression, denoted p{\Bl\), is a Hnear function of \Bl\. Not knowing \Bl\, we may upper bound this expression by 
its maximum value maxo<|Bi|<B /3(|-Bl|)- Due to the linearity, the maximum is always obtained at the edges \Bl\ E {0,B}, 
therefore 



p{\Bl\)< max p(|Bi|)= max p(|i?i|)=max(p(0),p(i?))-p(0) + [p(B)-p(0)] + 

0<\Bl\<B |St|G{0,S} 



where [x]'^ = max(2:, 0). Substituting in ( |125| ) we have: 

log ^o" < /o"' (V-o ) • (&0 + 2d - 1) • i? + B 
<ft\i^o)-{bo + 2d-l)-B + B 



log^+if-/(")W).6o 



-\ + 



log 



'-"— ^/,i"^W)-(6o + 2d-l)+log 



de 
n ■ L„ 
de 
n ■ Ln 
de 



- + 
i-K 

K] ■ B 



Extracting a lower bound on B from ( |127[ ) we have: 

m{_B-i)K 

-ftact ^ 



> 



log^Ao 



fr (V-o") ■ (bo + 2rf - 1) + log ^ + K 

^logV'g 

1 + F (log ^ + fo"\^o) ■ (bo + 2d-l) 

7 



K 

n 

K 

n 



Rr 



1+^ 



log ^^ +/(i"'(exp(ni?emp)) • {bo + 2d - 1) 



de 



V 



fcl 



K 

n 



(126) 



(127) 



(128) 



This proves the main claim of the theorem. Regarding the sufficient Markov-based CCDF condition ( |113[ ), it is easy to see 
that if ( |113| l holds then the bound ( |112| l is obtained by applying Markov inequality ( |25| ). D 



Proof of Corollary 7.1 



The proof is completely technical, by showing that under the conditions F„ {t) — > 0. If - log L„ — > then there exists a 
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sequence A„ e [0, 1], A„ -> such that — jr- logL„ — > 0. As an example we can choose A„ — min ( i/- logL„, 1 ). We 
choose K ~ n ■ max{A„, (5„, n^^/^}. Then for all t, the term in the denominator of Fn{t) ( |116| l satisfies: 

c„ + 5i • /^"^(exp(nt)) _ log f + logL„ + 6i • /^"^(exp(ni)) 



ii' 



n • max{A„, (5„, n ^/^j 



log J^ logL„ /^")(exp(nt)) 
- — T^ ^ A '^°i ' X — ^ ^ 



(129) 



>o 



>0 



and in addition 



K 



— max{A„, 5n 



-1/2 



} 



(130) 



Therefore F„(i) -^ t, and by definition, i?omp is asymptotically achievable. D 

Note that the condition on /q is essentially that Vi : — — '^^ — > 0, however it was defined by using a sequence (5„ 

since the convergence is not necessarily uniform in t, therefore it is not always possible to extract a sequence (5„ from /q 
itself (as we have done for the other overhead sequence - log L„). 

Proof of Corollary \7.2\ 
Define /q — ^'^^^t<R„t^^ fo i^^Pint)), and bound Fn{t) of ( |116| l for all t < i?i„ax as 



> 



> t 



t K 


k^^i^c^+b.-f^^^' t 


if 


c„+6i-/i"'* n 


1 1 fcjj_ 


n 


K 




\ K J n K n 




kn K' 
K n 


• 





(131) 



=<5„ 



p(")* 



with fc„ = c„ + 6i • /^ 

We choose the value of K that minimizes the overhead term 5n in the lower bound, using the following lemma: 

Lemma 6. For a > 0, 6 > with b < a 



T = mm 

fcGN 



i-- 



< 3Va& 



Proof of the lemma: It is easy to see by derivation that the minimizer over a: G M of - + 6a; is x* 

k* = [a;*] we have fc* e N and since y| < fc* < y| -I- 1: 



a 
k* 



bk* < 



1 = 2Vab + b = 2Vab 



I b<a . 

Ib^ < 3Vab 



applying the lemma with a = -Rmaxfcn, 6 = - we obtain 



n 1 



{cn + b^■ft^*) 



(132) 
/f . Choosing 

(133) 

D 

(134) 



Since kn grows with n, asymptotically a » b, and therefore the result in Lemma rolls closer to 2^/ab, and the factor in (5„ 
approaches 2, however this coarse bound was chosen for its simplicity, as it doesn't change the order of magnitude. For large 
n we can use the approximation 2y/ab. D 



Proof of Corollary 7.3- This stems directly from the CCDF condition, computed for k — n,j — 0: 

Ln 



Pr{7^(X,y,0)>t}< 



t 



(135) 
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Therefore the intrinsic redundancy ^ is 



fJ-qiRcmp) = sup <^ - log(3{i?omp(X,y) > R} + R 

y,ReR {n 

= sup i -log(5{V'(X,y,0) >exp(ni?)} + i? 



y 



Rm {n 



< 


sup <^ 

y.fleR L 


1 
n 


log 


exp{nR) 


= 


- log L 
n 


n 







R 



(136) 



D 
E. A conditional probability based empirical rate 



In Section V-D we have shown that, asymptotically, all maximum attainable rate functions are of the form i?emp(x,y) = 
n ^"^S oiZ! ■ We now present a specific "causal" structure for P and show that with this structure -Rcmp can also be adaptively 
attained. The set of P(-|) we use is based on a "causality" condition. 

Definition 9 (causality). A conditional probability distribution P(x|y) defined over x G <%'",y G 3^" is said to be D-causal 
(for some non-negative D), if for all k < n: 

P(x^-|y)-P(x'=|y^-+^) (137) 

i.e. computing the conditional probability of a sub-vector only requires considering D future symbols of y. 

An equivalent condition is that P(x"|y) can be written as P(x"|y) = YTi=i Pi^i\y''^^ t^^^^)- This is since we can 
always write P(x"|y) = lYLiPi^tlv^^'^^)^ and in this case P(x''|y) = ^^„ P(x"|y) = IlLi -PC^^Iy'^'"^)' ™d the 
later should be a function of y'''^^ for any k. Unfortunately, the causality we defined, and which is needed for the adaptive 
achievability, is the causality of the backward channel (from y to x). Most channel models define a causal relation from x to 
y, and if there is memory in the channel, the backward channel will not be causal, in general. To accommodate such cases we 
have allowed a dependence on D future symbols of y (see example ). Softer conditions can be defined instead of the strict 
equality in ( |137[ ) however this requirement is sufficient for our purposes. 

Given a D-causal distribution P, we define the following decoding metric: 

Q(x'=) V Q(x^) / Q(xJ+i|xJ) 

Note that for k < D x''^^ is simply the empty set and in this case we define P(x'^^^|y''') = 1. The equality holds by using 
Bayes rule and since due to D-causality we can replace P(x-'^^|y-') by P(x-'^^|y'^). Note that we can write (for k > j): 

V'(x^y^o) = ^(x^y^o)^(x^y^J) (139) 

Note that the above is analogous to Bayes rule. We will assume that x is discrete and therefore P(x'^|y) < 1. Regarding the 
conditional distribution of the input we make the assumption that any symbol that has non-zero probability, has a probability 
of at least g„ii„, i.e. for all k, Q{xk\x''^^) e {0} U [qmin, !]■ Under these assumptions ip defined above satisfies the conditions 
of Theorem |7] and its corollaries, and we have the following result: 

Tlieorem 8. Let (x,y) e A"" x y"' where x is discrete (\X\ < ooj, and let Q(jx.) be an input distribution that satisfies 
yk : Q{xi;\x''^^) G {0} U [(7min7 !]• Let P(x|y) be a D-causal conditional distribution. Define the following rate function: 

1 P(x"|y") 
n °^ Q(x") 
Then: 



Pcmp = r log X,,'„, ' (140) 



1) The scheme of Section VITA with ip defined in ( |138| ) and ip* — — — ' d° — > adaptively achieves P„(Pomp). where 
i.„(i) = (i + c„+(2...^i).iog,-M -\t^K ^uh c„ = log ^ 



/>\ 7-1 .7.7 7-77 c o / los *? ■ ■(cTi + (2rfFB^l)'log a ■ ) 

2) -Kemp IS adaptively achievable up to On = oy '"'° ""' 

3) Pomp '* asymptotically adaptively achievable 



a 



Note: as in Corollary 7.2 for small e and large n, 6n 
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Proof: What we will actually prove is the attainability of the rate function 

, 1 1 P(x"-'"|y) 

i?cmp' = - log Vo = - log ni ^ (141) 

n n Q (x) 

Since P(x"|y) = P(x"-^|y) • P (x;^_p_^i|yx"-^) < P(x"~^|y), we have that Pomp < Pomp' and therefore the 
achievability of Pomp' shows the achievability of Pomp- The adaptive achievability of the rate function above is given by 
Theorem [7] when the conditions hold. Below we prove the conditions hold: 

CCDF condition: we use the Markov sufficient condition. By plugging the second form in ( |138| l: 



E[^(x^y^J)|x^■]= y. H^\y\j)-Q{^';+M') 



i+i^^^"-' 



k-D I fc j-D 






(142) 



If k — j > D, then we continue as follows 

E 
Q 



•^k-D + 1^^ j + 1 ■^fe-D+1^'^ 

< E 1 = 1-^1'' 

•^k-D + l^'^ 



(143) 



If fc — j < D then the same bound holds based on \\A2\ (the number of elements in the sum is at most X^). Therefore the 
condition is satisfied with Pfc-j = \X\^ and ho — Q (i.e. holds for any value of fc — j). 

Summability: Let {jb, fcfc}^]^ be a set of segments as defined in the summability condition of Theorem ^ and let il^b be as 
defined there. Let A denote the set of indices not included in the segments, with \A\ ~ niQ. Using the condition on the input 
we have, for every sequence x with non-zero probability: 



imin 



We recursively use ( |139| l to write ^/iq as a product of ip over the segments and the V'(x', y% i— 1) over the un-included symbols. 

B B B 



6=1 ieA 6=1 6=1 

Taking logarithm, we obtain the summability condition with f^' = log g~?j^ 



B 



log '/'o - E loS ^fe ^ '"o • log (?„,■„ (146) 

/o ' 

Property (1) in Theorem^ is proven by directly plugging these values into Theorem^ and using Pomp < Pomp'- Further 
more. Pomp' is upper bounded by Pomp' < log'Zmin' '^^^ '^° '^^^ constraint on Q. This can be shown by definition but also 
derived from the summability condition with P = and hence mp = n. Property (2) is shown by using Corollary |7.2| with 
Pmax — log'Zmin- Property (3) can be shown by using Corollary [7.1 [ or either of the previous properties. D 

Note that by ( |142| i for the case D — 0, the CCDF condition holds also if the distribution is continuous (i.e. P(x|y), Q{x) 
are density functions and are not upper bounded by 1), since the sum will be replaced by the integral of P (x*^_|_]^|y'"',x-') 
over a;?+i G X''^^, which is one. The summability condition may hold with a different /g , if P(a;i|x*^^,y) is bounded. 
Therefore in the general case if P(x|y) is strictly causal (with D = 0) P(xi|x*~^,y) is upper bounded, and Q{xi\x'^^^) is 
lower bounded, the Pomp of ( |140| l is asymptotically achievable. 

It is worthwhile spending a few words on the limitation Q{xk\x''^^^) e {0} U [gmiml]- This limitation relates to the 
summability condition, where /o reflects the loss due to the fact we do not have a constraint on all the symbols as expressed 
in ( |123| l. As an example, suppose that dps — 1- We know that one symbol before the end of a block in the scheme, ^ < 'ip* . 
In the next symbol, the metric exceeds the threshold, but we do not have a bound by how much it exceeds it, and this gap 
is expressed by a loss with respect to the ideal rate function. If we let one of the values of Xk have a very low a-priori 
probability, this symbol occurs, and has a high aposteriori probability P(j:fe|y,Xj,_i) after seeing y, then the growth of the 
metric, ru^ \'^^\ "lay be unlimited. In [1 1 we did not use this constraint but as a result (of this and other technical reasons), 
had to define a set of sequences x for which the assertions do not hold. This is further discussed in Section . In the discrete 
case, this condition is plausible, and we can find a q^^^ for any Q since there is always a minimum value to the non-zero 
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probabilities. We will also use a similar constraint, from similar reasons, for the continuous case. In this case, the condition 
changes the distribution, but it can be considered a replacement to a "failure set" as was defined in [1 J, and its purpose is to 
prevent symbols which have unlimited contribution to the rate function. 

F. The ML rate function 



We have presented the ML rate function in Section VI-B 



i?rmp = max - log ^^^ = - log "'^';;;,;^^ (147) 



1 Pe(x|y) ^ 1 PM^xly ) 
"eTe 71 °^ Q(x) n °^ Q(x) 

Where Pe(x|y) be a family of conditional distributions, indexed by a parameter e O, and Pml is the maximum likelihood 
conditional probability ( |54] i. Our purpose is now to attain this rate function adaptively, up to overhead terms. We will present 
some general cases in which is it possible to do so. Note that achieving -R^p adaptively also means achieving R^^* of ( |8T| ) 
adaptively. 

1) The discrete case based on a weight function: The first case of interest is when there exists a weighting function over 
Q, denoted w{0), with Jq w{0)d9 = 1, and a constant C„ such that 

PML(x|y) = maxPe(x|y) < C„ • / w{e)Pe{^\y)de (148) 

Where the constant C„ grows sub-exponentially with n. The term ^^ „w{0)Pg{:x.\y)d9 is sometimes termed the "Bayesian 
mixture" of the distributions Pg with a prior w{6). Mixtures of this type appear as solutions to the minimax redundancy problem 
|[T8||7| (sometimes termed "average regret"), seeking to minimize the maximum divergence D{P*\\Pg) between a universal 
distribution P* and the set of distributions {Pe}, while we require the relation in (|148[) to hold per point (x, y). Our target in 



\\A%\ of upper bounding pML(x|y) is related to the problem of minimax regret ifTSl which was discussed in Section VI-B2 
i.e. the problem of finding a distribution P* (x, y) which is close to pML(x|y) in the sense of minimizing the maximum regret 
maxxlog ^pl','^^ . For the class of conditionally memoryless distributions it was observed by Xie and Barron |19| that the 
the Dirichlet- 1 Bayesian mixture which is the asymptotically optimal solution to the minimax redundancy problem, also yields 
yields a nearly optimum maximum regret. See Section for further details. 

Suppose for example that the number of 9-s achieving the maximum is sub-exponential. An example of such a case is when 
Pe(x|y) is the memoryless distribution where 9 is the single-letter conditional distribution. In this case the 6 achieving the 
maximum is the empirical distribution (see Section |VI-A3| l, and the number of empirical distributions (conditional types) is 
bounded by (n+ 1)'"^' l-*^'. Similarly, if Pg{x\y) is defined by higher order conditional distributions, the number of maximizing 
d-s can be polynomially bounded. Denote by 9 the set of 9-s that may achieve the maximum, and assume |6| < C„. Then 
( |148[ ) holds in a straight-forward way by defining a uniform w{9) over Q, i.e. use a discrete weighting that gives a weight -i- 

for every 9 £ Q and zero otherwise. In this case: 

maxPe(x|y) = maxPe(x|y) < ^ Pe(x|y) < C„ • ^ r^ Pe{^\y) (149) 

w(e) 

However the number of maximizing 9-s is a rather coarse bound, and a better bound may be obtained by assuming that 
Pe(x|y) is smooth in 9, and therefore if the 9 achieving the maximum in the LHS of ( |148| ) is 9*, the integral on the RHS 
includes a volume surrounding 9* in which P6i(x|y) is close to Pe.(x|y). Therefore, the integral in the RHS contains an 
integral over this volume of w{9), rather than just the contribution of w{9*), and as a result the integral is larger than the 
simplistic bound of ( |149| l, and the coefficient C„ can be reduced. 

Supposing ( |148| l is satisfied, and assuming Pe(x|y) is D-causal, then it is easy to see that the weighted distribution 



P^(x|y) = / w{9)Pgi^\y)d9 (150) 

is also £)-causal, since P„,(x''|y) is only a function of y* 

P„(x^-|y) = V P„(x|y) = / w{9) V Pg{^\y)d9 = / w{9)Pe{^''\y)d9 = / w{9)Pe{^''\y'')d9 (151) 

^ Je ~r Je Je 

^k+l ^k+1 

It is interesting to note that the fact Pe(x|y) is D-causal, does not mean pML(x|y) is Z?-causal (only that P^j is, and P^ can 
be used to upper bound Pml)- Therefore we have that 

OM. , 1 . PmUMy) m 1 1„, gnP^(x|y) _ 1 . P^(x|y) logCn 
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Q(x) 



if the conditions of Theorem 8 hold with respect to Pg and Q, then ^ log 

given by Theorem 8 and therefore R^\j^p will be achievable up to these factors plus °^^ " 

is sub-exponential)^rhe conclusions from this discussion are formalized in the following theorem: 



is adaptively achievable up to the factors 
(which tends to zero with n if C„ 



Theorem 9. Let Pg(x|y) be a family of conditional distributions, indexed by a parameter 9 G Q, and Pml be the maximum 
likelihood conditional probability ( |54| l. If the conditions of Theorem [Sl hold with respect to Pg and Q, and ( |148| l holds, then 
defined in ( |147| l is adaptively achievable up to 5',-^ = (5„ + °^^ " , where (5„ is defined in Theorem S 



DML 

If further 



1 i„„ PML(x|y 
n ^Og Q(: 
logC„ 



0, then -RqJj'Jp w asymptotically adaptively achievable. 



Note: if ( |148[ ) is satisfied then the intrinsic redundancy of R\ 

1 



^^ satisfies 



i^q{R"Lp) 



logE[exp(nPrip(X,y))]=-logE 
n Q ^ n Q 



1 



< - log { C„ 



w(6')E 



Ge 



Pe(x|y) 

Q(x) 



PML(x|y) 

g(x) 

de) = -iogC„ 



(153) 



Therefore by Theorem 



0itis 



achievable (non adaptively) up to 



logC„ 



The last term, which is related to the complexity 



of the parametric class is common to the adaptive and non-adaptive case. The first term increases from 



log 



case to 5n 



e 



log ' 



of Theorem 



in the non-adaptive 



in the adaptive case. I.e. the penalty payed for the error probability increases by 



log n 



I.e. 



a square root, and an additional redundancy of 9 I \ -^^ ) is added. In many cases °^^ " decays to like 8 ( 

faster than (5„, and therefore the main overhead is due to the rate adaptivity scheme, and not for the complexity of the class. 

2) The conditionally memoryless discrete case: In Theorem l9] we characterized the redundancy achieved by the adaptivity 
scheme using the factor C„ from ( |148| l. The additional redundancy related to the parametric class according to Theorem l9] is 
■^ — -. We now give an expression for C„ for the conditionally memoryless case, based on known results on minimax regret. 

Let z S Z" be a vector of states, which may have an arbitrary dependence on x and y. In the simplest case z = y. Our 
parameter class Q is the class of memoryless conditional distributions of x given z, defined by the conditional probability 
function 9{x\z) with x e A", z e Z and where J^xex (^(M^) = 1- Th^ probability of x is: 



P9i^\y) = l[eix,\z,) 



(154) 



The functional dependence of z in x, y is implicit in ( |154| i, i.e. for any value of x, y we first calculate the vector z and apply 
it to ( |154| i. In order for Pg{'K\y) to be a probability (i.e. sum to unity over x € A""), we need to assume that Xi does not affect 
Zi. Specifically, we restrict Zi to depend only on the past of x, i.e. on x]~^ and the entire y. In this case it is easy to see 
that ( |154| i defines a legitimate probability (by summing first on a;„ and then on Xn-i, etc). This distribution was discussed in 
Section VI- A3 [ where it was shown that the maximum likelihood solution is the empirical conditional probability, and therefore 
the maximum likelihood probability p^L is the empirical conditional probability. 

Since all \X\ ■ \Z\ elements of the conditional probability vectors are in {^} _n., the maximum always occurs within a 
limited set Q of at most \&\ < (n+ l)l'*ll^l sequences, therefore as already mentioned in Section VII-Fl a coarse bound on 
Cn is (n + l)\^\-\z\, which yields a redundancy of ^-^^ « \X\ ■ |Z|^. 

Xie and Barron ifTOJI gave asymptotically tight expressions for the maximum regret associated with Bayesian mixtures in 
the memoryless case. We first state their results for the non-conditional case Z = 0. Define Pml(x) — ma.xgf^.-f Pg{x.) = 
_^ 0{xi), and Pu,(x) — J„ Po((x)w{9)dO. Their Lemma I states that when using the Diriclet-^ prior for w{9), i.e. 



maxe(.) n^ 



VTuTeM 



(where c is the normalizing factor), the regret satisfies (see Equation (23) for an explicit bound): 

, Pml(x) d "- , r- , l-^l , , /1^ 



(155) 



where d = jA'l — 1 is the number of free parameters, o„(l) 



C. 



X 



log 



\x 


•loge 




4„ 


r 


(i)l"l 



0, and 



m 



(156) 



This observation is attributed to Shtarkov 0201 but was given a more explicit expression by Xie and Barron (see also in Cover 
and Thomas llT4l Section 13.2], Cesa-Bianchy and Lugosi lITOl Remark 9.4]). 
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Furthermore, they propose a sHghtly modified distribution w{9) for which: (Theorem 2): 



1 ^ivii.(x) d n 

lOE ^^ < - log 

^ P„,(x) - 2 ^ 27r 



+ Ca'+o„(1) 



(157) 



The term on the RHS of ( |157| i tends to the asymptotical minimax regret (i.e. the regret achieved by the NML), i.e. this 
weighting scheme asymptotically loses nothing with respect to the optimum regret. Note that the expressions in ( |155| l and 
(|156[) both share the common factor | log n, and the difference is an increase of the constant factor by Li log e in the Diriclet 



prior with respect to the minimax solution and the prior proposed by Xie and Barron. This weighting has the property that it 
depends on n, whereas the former Dirichlet mixture does not. They extend their results to the conditional case (see Section 
IX there), however the proofs are quite involved. 

Below we show how the result regarding the Diriclet prior is extended to the conditional case. Although this extension is 
quite standard (and sub-optimal compared to Xie and Barron's extension), we present it explicitly here in order to show that the 
dependence between x, y and z does not change the result. The parameters are now the set of \X\ ■ \Z\ values of the function 
d{x\z) which have {\X\ ~ 1) • \Z\ degrees of freedom (since Vz : J^x ^(^^l^:) = 1). The prior is simply the product of Diriclet 
priors assigned to each function 0{-\z) for each value of z, i.e. for a probability vector d = 9{-\z) let u;o(^) = 

then the weight function is w{d) — Yiz '^ol^C'k)) 



VtlT^^^ 



For each z, consider each sub-vector of x at the indices where Zi = z. The result is based on the fact that the parameters 
for each sub-vectors are separate, and therefor the problem can be reduced to the non-conditional case. In the maximum 
likelihood solution, each sub-vector has a set of variables independent of the other sub-vectors and therefore the maximum 
likelihood probability of the sub-vector depends only on the empirical distribution of x over the sub-vector. For the mixture 
distribution, the dependence on 0{-\z) stems only from the elements of the sub-vector associated with z and the integral can 
be separated into a set of weighted distributions on the sub-vectors, which are related to the maximum likelihood probabilities. 
The regret terms for each of the subvectors are accumulated, and bounded by a convexity argument. Rewrite the RHS of ( |155[ ) 
as ci log n + C2 to express explicitly the dependence on n, then: 

logP^(x|y) = log / w{9)l[e{x,\z,)d9 

i—1 






i:Zi^z 

MS{-\z))- n o{x,\z)de{-\z) 

i:Zi^z 

' cilog [nP:^{z)\ 



log max TT 9{xi\z) 
\e(-l^)..H- 



=iogn 

zez 
= 11 log 

z£Z 

= logPML(x|y) - ^ \ci log {nP^izYj + ca 

zez 

= logpML(x|y) - \Z\ • C2 + |Z| • ci • ^ — - log {nP^{zyj 

zez ' ' 

Convexity / ^ ^ 

> logPML(x|y) - \Z\ ■ C2 + \Z\ • ci • log I 2^ ■r^,nP^{z) 



(158) 



= logpML(x|y)- \Z\ 



c\ ■ log 



C2 



Therefore the regret becomes 



ci • log 



C2 



\X\ 



1 n 

-log 



\X\ 



2 ,2|+C. + Vloge + o„(l) 



(159) 



It is important to note that x, y and z are all constant throughout ( |158| l, and therefore the result is oblivious to any dependence 
between them. The modification of Xie and Barron's asymptotically optimal result to the conditional case results in a similar 



expression, i.e. \Z\- 
of the form: 



ci-log(^) 



C2 



where ci, C2 are taken from ( |157| i. One way or the other, we have obtained a relation 

(160) 



, PML(x|y) 

log „ . , , < r„ 



^^(x|y) 
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I.e. ( |148| l holds with C„ — exp(r„). Therefore the redundancy term °^^ " of Theorem 9 is 

i^i^ = ^ (161) 

n n 

We summarize these results in the following theorem, which specializes Theorem |9] to the case of conditional memoryless 
distributions. 

Theorem 10. Let z G Z" be a discrete vector of states, which is a function of x and y, where z^ may arbitrarily depend on 
yH'^ and y^^ for some delay D > 0. Let (3(x) be an input distribution over a discrete set X that satisfies Vfc : Q{xk\yi!^^^) G 
{0} U [(/min, !]• Define the following rate function: 

1 p(x|z) 
n ^ Q(x) 

Rc-aip i'' adaptively achievable up to 5[^ = (5„ + -r„, where S^ is defined in Theorem\8\and 



Romp = z log T=>7::^ (162) 



= |Z| 



"""' log ^ + C. + ^ lege + o„(l) 



(163) 



2 ° 27r|Z| "-2 

with Cx defined in ( |156| l one/ o„(l) = ^ — . °^'^ — )> 0. Furthermore, this rate function has intrinsic redundancy fxn < -Vn 
anc/ is achievable non adaptively, up to -(r„ + loge^^). 

Proof: based on Theorem l9 and the discussion above. Note that for the conditionally memoryless class we defined, 
PML(x|y) = p(x|z). Theorem 9] requires that the conditions of Theorem [8] be satisfied with respect to Pe(x|y) and Q. 
Specifically, Q needs to be bounded from below, and Pg{-x.\y) of \\5A\ is required to be Z3-causal which is obtained by 
allowing Zi to depend only on the past of x and D future samples of y. In this case, in the conditionally memoryless model 
of (fT54ll, P0{x^\x^-^y) = Pe(x,\yi'-'^Y^+^) = e(x,\zi{:id'-^y^+°)), since x'-\y*+^ completely define z, (see Definition|9). 

The result in the non-adaptive case and the bound on the intrinsic redundancy follow from Lemma [3] since we can write 
( |160[ ): -Romp < ^ log QUA + ^^ri' ^nd by Lemma 3 the first part has /ig < (see also the note following Theorem 



9i 



Note that the additional redundancy ^ « - — ^'^^^ l~^ . -^^, is better than the redundancy « jA"! • |Z|-2£^ which is obtained 
using the simple bound based on the number of types (Theorem |6]). 

3) The continuous and general case: The discussion above was relevant for the discrete case only. For the continuous (or 
general) case, we need to consider additional constraints. We define the following decoding metric: 

^(x^y^,■)=f"^^-^;yv'^;'"^)V (164) 

where 7 G (0, 1) and we assume Pg is strictly causal (with 13=0). When this decoding metric meets the conditions of TheoremlT] 
the resulting rate function would be 



- log V' X, y, 0) = - • 7 • log -—— 

n n \ Q (xj 



i?cmp = - log V(x, y, 0) = - • 7 • log \ '■" = jRZp (165) 



i.e. achieves R^'^p up to a multiplicative factor, which we would like to take to 1 as n ^ cx). 

We now analyze the conditions required for ^fj. Unlike the discrete case in which we could easily characterize a set of 
rate functions which can be adaptively achieved by the scheme presented, in the general case we do not have such a simple 
characterization. Instead, we give below some analysis of the conditions. 

The Markov sufficient condition for the CCDF requires bounding the following quantity: 

E [^(x^y^ j)ix^] = J ('?^^^|gl0i^'j q {^^,\^) d^^, = jpi^ {^u\y'^^') q'-' (4+iI-0 ^^U 

(166) 
Note that the same applies for discrete x, replacing the integral with a sum. For 7 = the value above is simply the integral 
of Q and is therefore 1 (and bounded), and therefore it is reasonable to assume that there exists a < 7 < 1 for which the 
integral above is bounded. For 7 = 1 the above evaluates to the redundancy term in universal coding (see Section |VI-B| l, 
however this term may be infinite when the distribution is continuous. 

The summability condition can be written as follows. Suppose that 6* — argmax Pg (x|y), and as in the proof of Theoremlsl 

let {jf,, fcfcl^i be a set of segments as defined in the summability condition of Theorem M and A denote the set of indices 
not included in the segments (unconstrained symbols), with \A\ — rriQ. 

We assume that Q{xi[x.^~^) is bounded from two sides, i.e. < q^^n < Q{xi\x^^^) < q^ax < 00. For many distributions 
of interest (such as the Gaussian distribution), the lower bound gmjn does not exist, and we need to "enforce" it by removing 
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the tail of the distribution. In the current scheme it seems there is no way around this, since the scheme fails to attain i?cmp 
if an unconstrained symbol appears, which has a very small a-priori probability Q and the posteriori probability (which is 
controlled by the channel), is not small, may increase i?cmp in an unbounded amount, which is not utilized by the scheme. . 
We may expand the probability Pg<. (xly) by Bayes law: 

B 

Pe. (x|y) =l[Pg, (x^^Vily-^'O ■ n^»* (x»|y,x^-') (167) 

5=1 ieA 

and similarly for Q: 

B 

g (x) = [] Q (x^\Vi|x^^) • n Q (x^|x'~i) (168) 

6=1 ieA 

The terms in the first product in ( |167| l are bounded by the maximum likelihood value over the segment: 

Pe, (x^^Vi ly- x^O ''^"'="'"'" Po' (x^;Vi |y'-^ , x-'") < max P, (x^\Vi ^'^ > x>) (169) 

The second product in ( |167[ ) relates to the "unconstrained" symbols (see the proof of Theorem |7|. Regarding the terms in 
this product Pg» (xj|y,x*^^) we do not have a general bound and they may be bounded in specific cases. 

One simple case is when Pg is globally upper bounded (i.e. V0,x, y, i : Pg (xJy,x*^^) < c), however this is a rare case, 
since if, for example, the parameter space enables scaling of Pg and this scaling is not bounded, then it is possible to obtain 
unlimited values of Pg (xi|y,x'~^) by scaling. If s denotes the shrinkage ratio between 6' and 9 (applied, for example, for 
both X and y), then Pgi (xjy, x'"'^) = s ■ Pg {^s ■ xAs ■ y , s • x*^^), and we may obtain unbounded value by taking s -^ oo. 
As an example this occurs in the Gaussian case (see Section ) where the parameter 9 is the covariance matrix. 

A softer requirement is that the probability Pg will be bounded per value of 9: V0,x, y,z : Pg (x,;|y,x'~^) < Pmax(^)- 
In this case we may use the fact the gap in the summability condition depends on the value of ip^. In many cases we 
can draw a bound on Pg- (xj|y,x*^^) from the knowledge of pML(x|y). The reason is that 9 = ^ML{x|y) maximizes the 
product of all Pg (xi|y,x'~'^) (for i = I, . . . ,n), and therefore for many "smooth" distributions 9* strikes a balance between 
the probabilities assigned to each symbol. In these cases the probability that any specific symbol may attain while the total 
probability is bounded, cannot grow indefinitely. Specifically in some cases of interest, including the Gaussian case, knowledge 
of PML(x|y) yields an information on 9*, which can be used to upper-bound Pg (xi|y,x*^^), i.e. let 



C,{ML) 



e 



(t) - {eM,(x|y) : PML(x|y) < t} (170) 

I.e. Q^^^-^\t) is the range of possible values of the maximum likelihood estimator (over all x,y), for which the maximum 
Ukelihood probability is no more than t. For example in the Gaussian case . Now, since 

PM.(x|y) = 0(x) . i^j-)'/^ < Ca. • W)'/'' (171) 

and recall that 9* = ^MLlxjy), we may bound Pg* as: 

Pe-(xJy,x'-i) < max P^axW = 5o W) (172) 

In other words, from tpQ we bound PML(x|y), obtain a range of possible 9*-s and find the maximum single-symbol probability 
that may be assigned using these 9*-s. This bounding technique can be better understood by reviewing the example of the 
Gaussian case which is . 

We summarize these conclusions in the following lemma: 

Lemma 7. Let ip{x'' .y'' ,j) be defined in \\6A\ where 7 G (0, 1) and we assume Pg is strictly causal, and (5(x) is bounded 
by Q{x) e {0} U [gmin,gmax] (wliere < q^in < 9max < ooj. Let e(*^^)(i) = |^ML(x|y) :pML(x|y) < tj. If there exists 
-Pmax(6') such that \/9,x,y,i : Pg (xj|y,x*"i) < Pinax(^), and .go(V'o) - ™axgge(jft)(^„_^^.(^„)i/^-i Pmax(^) < 00, then the 
summability condition in TheoremVA holds with /o('0o) = 7 ' los(ffo('/'o) ' 9min) 

Unfortunately, in the general case, the probability of a single symbol Pg* (xi|y,x*^^) cannot be upper bounded even when 
^ML(x|y) is known (see Example ). In this case the summability condition does not hold, and we cannot attain the rate 
function R^'^^p using the scheme proposed here. The failure occurs with respect to the "unconstrained" symbols (toq) in the 
summability condition. These symbols are related to the increase of the rate function at the symbol in which the decoding 
occurred. Therefore one might say that failure to obtain the rate function in these cases stems from the scheme and the fact 
that it does not "use" all the symbols. On the other hand, it is quite difficult to envision an adaptive scheme that does not 
have this limitation. If the rate is determined by negotiation between the encoder and the decoder, then an unlimited increase 
of the rate function i?^^p that occurs at the n-th symbol does not allow the system to adapt its rate (since the feedback for 
this symbol is not relevant). It's worth noting that in posterior matching scheme II2TI for the known memoryless channel (an 



extension of Horstein's scheme 112211 '). the rate for a given error probabihty e can be determined by the decoder after reception 
(without coordination with the encoder, who always transmits the infinite sequence), however it is not trivial to extend this 
scheme to the individual case. 

Assuming the above assumptions holds, we have: 

1 Pg- (xly) 

logos' = log- 



7 Q(x) 



^^ P,>(x^,Vi|y,x^-^) P,>(x.|y,x'-i) 

^^ > log S , + ? log 7 T 

h Q(x,\Vi|y,x-) ^^ g(x.|x- 



i-n 



(T69l,(T72l JL maxe P9 x,-" , Jy, x^" 



= - V log V'h + mo • log (50 (^0) • 9mL) 
Therefore the summability condition holds: 



(173) 



Viog " "^'T"' ^ +y log g^^Ml 



B 



log i,l - y log Vf, < mo • 7-log(goy)-gn;in)^ (174) 

with/o(^?,^)-7-log(5oW)-9mi„)- 

To summarize, in the general case we do not have a general characterization of rate functions that are achieved by the 

scheme presented, and specifically there is no general claim that i?c^^p '^^'^ ^^ adaptively achieved. In specific cases, we may 

use the techniques shown here: the CCDF condition requires bounding the value in ( |166[ ). For i?oJip, the summability condition 

holds in general with respect to the "constrained" segments, but particular treatment (per parametric family) is needed for the 

"unconstrained" symbols, possibly by Equations ( |170| )-( [T72l i. Furthermore, to obtain these bounds we need to constrain Q by 

a minimum and a maximum value. 

4) Examples for the bound on the unconstrained symbols: Below we give some examples to better illustrate the bounding 

technique presented above for the unconstrained symbols (Equations ( |170| i-( [T72] i), and its shortcomings. 

Example 5 (A gaussian model). Suppose that the model for x given y is an i.i.d. Gaussian model, where Xi is Gaussian with 
mean a ■ Vi and variance a^,, . There are two parameters — (a. ai. ), and the distribution is 

x|y A ^ ' x\y ' 



•c\y) 



Pe(x|y) = (2^a2 )-"/2e -'l^y--''~' -" (175) 



It is easy to check (e.g. by derivating logP0(x|y), see also ) that Aml = ^yp, and ^l^y^i^ = ^llx - a^i^yW^, substituting 
we obtain 



PmaxW = max(27ra2|^)-i/2e ^%n = (2^^2^^)-i/2 (^g) 



PM.(x|y) = {2.aX)-''e ^<^^ ^'" = (2vra^|,e)-»/^ (176) 

therefore \\1Q\ : 

e(*")(t) = {^ML(x|y) :pM.(x|y) < t} - {(«,^^|,) : {2TTa%ye)-^^'^ < t} (177) 

and the maximum of the single letter probability is 

-(xi-a-yi) 

c(27r<|^)-^/^e -^1^ 
by ([172}: 

5oW)= max P„,ax(e) = max {271^1, y^'^ = q.^^^e''^ ■ {iP"^)^ (179) 

and by \\1A\ : 

/o(Vo ) = 7 • log(ffo(^0 ) • ^min) = 7 ' log ("^^S^) + 1 . log(^o") (180) 

Example 6. As another example we consider the case is when X given y is modeled as i.i.d. where each symbol Xi is 
conditionally distributed around y^ with a scale factor proportional to 6: 

Pe{x^\v^)^e■f{e■{x,-y,)) (181) 



where 

/(i) = c-e-l*l' (182) 

p > 1 is a fixed parameter, and c takes care of normalization so that J f{t)dt = 1. This family includes as special cases the 
symmetric exponential distribution p — I and the Gaussian distribution p = 2. We have Pe(x|y) — 0"c"e^^''^> ki-j/il'' it is 
easy to check that ^^l = (^ X^i \^i ~ Vil^) ' ^i^^ therefore p^L — ^mlC"^^"/^. As before, p^L and ^^l are related, and 

we have: e(^^'^)(t)^^ML(x|y) : PML(x|y) < t\ = {9 : 6l"c"e-"/P < t} = {~oo,t^/'^e^^Pc~^]. In this case Pmax(6') = 9c, 
and we have from ( |172P : 

5oW)= max P„,axW - max P„,axW = ^maxe^/P • (Vo)^ (183) 

Example 7 (A general counter example). A rather general case where the summability condition does not hold is when on 
one hand the probability Pg (xJy,x'^^) is not globally bounded, and on the other hand, the parameters 9 contain a separate 
set of parameters for each value of y^. In this case, if the value of yi on any symbol is unique and does not appear elsewhere, 
then the probability assigned to this symbol may grow indefinitely, while, with a suitable choice of the other symbols, the 
overall maximum-likelihood probability pML(x|y) may remain bounded. 

Example 8 (The discrete case). Consider the discrete memoryless case where 9{x\y) is the conditional probability of symbol x 
to appear when y appears. In this case, the maximum likelihood estimator is the empirical distribution 6'ml(2;|t/) = Px|y(a;|y)i 
and the maximum likelihood probability is the empirical probability pML(x|y) = p(x|y) = exp(— n_ff(x|y)) (see ([66])). Pmax(^) 
in this case is simply max^ j, 9{x\y). The empirical entropy related to the empirical probability, however there is an unknown 
factor which is the empirical distribution of y. Since we are looking for a bound on 9{x\y) in terms of pML(x|y), which holds 
for any x, y. Using the techniques of the previous section, we cannot do better than simply bound the probability by 1, i.e. 
5o('0o ) — 1 (s^^ ( |172| l). This is because for a pair of random variables X, Y it is possible to have a large conditional probability 
Pr(X|y) with a small effect on the conditional entropy H{X\Y) if Pi-{Y) is small (tends to 0). The actual implication is 
that if the value of yi on an"unconstrained" symbol is unique (does not appear on the constrained symbols), the empirical 
probability of this symbol may be 1, while the empirical probability of the rest of the sequence may vary arbitrarily. 

Example 9 (Another counter example). The counter example we gave above requires that 9 contains a different set of parameters 
for each y. However, we can show that much less is necessary in order to have an unlimited loss gni'fpa), and this may occur 
even for the simple case of a memoryless distribution with a single scale parameter. We argued in the previous section that 
the maximum Ukelihood solution tends to equalize the probabilities assigned to various symbols. The following example is 
based on creating a region in which the distribution decays rapidly to 0. By letting one of the points reside in this region, the 
maximum likelihood solution gives a large part of the probability to this point. 
We consider the same setting of Example l6] except the distribution / is: 

/W = ^-e^*l"'' (184) 

Note that /(t) is the probability density function of l/Z where Z is distributed according to the density /(t) defined in 
Example [6] so we have just changed variables. Note also that f{t) is upper bounded and therefore Peixi\yi) is bounded for 
each value of 9. f{t) decays exponentially to for i -> (due to the exponential term). We have 



Pei^\y) = ll 



iO{x. ~ y^W 



^-\(e{x,~y.))\- 



nr=i(^»-y« 



'Er=ik.-y.l " (185) 



It is easy to check that ^ml = (^ X^ILi \^i ~ Vil ^) ' however due to the term n"=i(^j — yi)^, Pml cannot be expressed via 
6'ml alone, and ^^l cannot be bounded given p^L- We will now show a choice of x,y for which the probability density of a 
single symbol i = I, Pe{xi\yi) tends to oo while the overall probability Pml tends to 0. Let xi—yi = 6, and Xi — yi — ?> oo, i > 2, 
then ^M,^(£)'/^ri, and 

PML(x|y) = P§,JMy) -^ const • (5" • • 6"^ = (186) 

while 

Pe^M\y^) = ^M^^" (^^,\^)2 • e-^Mll-.-..|- ^ const . 5 . 1 . e"? = const • - (187) 

By taking (5 — > we obtain Pg (a:i|yi) — > oo. This demonstrates a distribution which is controlled by a simple scale 
parameter, where the summability condition does not hold. 
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G. An infinite horizon adaptive scheme 

The scheme of Section [VII-A| and Theorem |7] is a finite-horizon scheme, i.e. the rate is measured at time n and the scheme 
is aware of the value of n and is designed to meet the promise of the theorem at this point. It is of interest to consider schemes 
that do not have this Umitation, i.e. they are designed without knowing n and still yield similar guarantees to the guarantees 
of Theorem It] for any n, and specifically, the convergence of the actual rate to the asymptotical rate function given by ( |115| l. 

A straightforward modification of the scheme presented here to the infinite horizon case is difficult due to the inherent need 
to design the information contents of a single block, K, to keep the overheads small. As can be seen in Corollary |7.2| there 
is a balance between the overheads incurred at each block and the loss of the last block. One could change K from block to 
block (e.g. according to the block index, the elapsed time t or the value of the metric V'o)' but an inherent difficulty occurs 
because the overhead term related to keeping the error probability small increases with time. If we have set a certain value of 
K for the current block, and the block extends indefinitely (due to a very low value of V' or equivalently i?omp), then at some 
point the overhead for keeping the error probability low would become significant with respect to K. A possible solution is to 
stop the transmission at such a case, and re-start it with a larger value of K but this complicates the scheme and its analysis. 

We present here a simple, brute force, modification of the scheme to the indefinite horizon case by an extension termed "the 
doubling trick" and used in universal prediction as well [10. Section 2.3] to solve a similar problem of matching the scheme 
parameters to the block length. This scheme is certainly not the most efficient way to achieve the infinite horizon property, and 
is given here only in order to show that it is feasible to do so. To simplify, the result is particularized to the case where both 
Rcm-p and /q"^ are upper bounded by constants, and L„ is subexponential in n (these assumptions are correct for the cases 
). The idea is to operate the scheme over epochs in time ni with increasing lengths. In each epoch, we design the scheme 
parameters to be optimal for the end of the epoch. If the observation time n occurs before the end of the epoch, the parameters 
are slightly suboptimal but the loss is small. In the simplest form, each epoch is double the size of the previous one, hence 
the name "doubling trick". 

The first step is to examine the loss incurred when the scheme's parameters are designed for time h (where h is the horizon 
for which the scheme is designed), while the actual performance is measured at time n < h. Considering again the proof of 
Theorem ItJ we now make a distinction between the value of n used for selecting the scheme's parameters (which is now 
termed h) and the value of n which is the observation time, i.e. the time when the actual rate is measured and compared 
against the empirical rate function. It is easy to see by following the proof, that if the scheme is designed to yield an error of 
no more than e up to any time n < h, then only the determination of the thresholds ip* changes, and the rest of the analysis 
remains the same. The result is that if the scheme is not aware of n and just given an horizon h > n, then the results of the 
theorem still hold with c„ replaced by c;, in ( |116| l. The next step is to choose K. Considering the proof of Corollary 7.2 the 



value kn in ( |131| l is now replaced with c^ + bi ■ /q , however because we assume that /q is upper bounded by a constant 
/o < /o^ then this simply becomes a function of h, kh = Ch + &i ■ /o, and by substituting in ( |131| l, we would have a 
redundancy of 5 = ^"^'^'^'^ -f ^. Note that the second factor is still a function of n since the loss of K bits of the last block 
is divided by the duration n of the observation time. Choosing K = [-\/ft,fc;ji?,jiaxl (optimized for n ~ h), we have 



.Vh 




- X (188) 

.x + lj 

We select the sequence of epoch lengths to be the power of 2, hi = 2\i — 1,2, . . .. Denote by Ni the end time of the 
i-th epoch, i.e. Ni — X^^i ^i — 2*^^ ~ 1- We distinguish between the epochs themselves that do not depend on n, and the 
"observed epoch", which the part of the epoch which is included in the period of time 1,. . . ,n which we observe (and is 
an empty set of all epochs after time n). We denote by j the index of the epoch that contains time n, i.e. iVj-i < n < Nj. 
We denote by rii the length of the observed epoch, i.e. Ui — hi for all epochs except the one containing symbol n, and is 
rij =71 — iVj-i for this epoch. We denote by Ni = inm{Ni,n) end of each observed epoch. In each epoch we design the 
scheme for a different error probability e^ where the sequence of error probabilities satisfies X^i^i ^i — ^- This guarantees an 
error probability at most e no matter what the observation time is. Specifically we choose e^ = ^ (X^^i ^ — 1 + X]i^2 ^ — 



2). 



^ ^ l-in=2 n(n-l) ^ ^ l-in=2 [n-1 n\ ^ ' [2-1 c 

The scheme operated at each epoch uses the metric V'(x'^,y'°, j) to decode the blocks. This metric uses the entire history 
from time 1, and therefore the scheme operation in each epoch is dependent of the value of x and y in previous epochs. 
We assume that the conditions of Theorem IT] hold for any epoch with any length, and specifically the summability condition 
holds not only for periods of time starting at 1 (in which case ^p iii the condition is replaced with ip{x^\y^' , Ni^i), for 
the observed epoch [Ni^i + l,Ni]). It is straightforward to modify the proof of Theorem M to see that the rate function 
Rcmpi = :^ '^ogip{x^\y^' , Ni^i) is obtained. From the derivation above ( |188| l we have that with our choice of K, it is 

obtained up to (5^ = — I 2i //li ( log '' ''' + bi ■ Jq) i?.max + 1 I , in other words the actual rate over the i-th observed epoch 



satisfies i?act > -Romp — Si. Since the number of bits transmitted in the i-th epoch satisfies n,;i?act, we have that the total 
number of bits k transmitted up to time n satisfies: 



fc = 2J "i-Racti > 2J '^i (-Romp,; " Si) 



^ - ~ ^ (189) 

i=l 4=1 



—nS{n) 

where the last inequality is due to the summability condition (note that here the segments cover the entire period 1, . . . ,n 
therefore mo = 0). Therefore with i?cmp = - logV'o we have: 

R.ct = - > - log log ^bo ~S{n)^ i?emp " S{n) (190) 

n n 

We now bound S{n) to show S{n) — > 0. By substituting Ni = 2*+^ — 1 in Nj^i < n we have that n > 2K Therefore 

none of the epochs 1, . . . , j is larger than n: hi > hj — 2^ < n. 

3 
nS{n) — y riiSi 



<J2hJh,(iog^^^ + b,-f*]R„ 



hi<n,ei>ej -L f / ri ■ L \ 



(191) 



J+hl[^og^^^^+b,-fAR. ^-^ 



2ed ' "' ^uy-max ^_^ 
< log2(n) + 2^/ ( log ]^J^^ >> + 61 . /* j i?^a, . -^=— 



therefore 5(n) — > under the assumption that L„ is sub-exponential (i.e. log °^ " — > 0). 

VIII. Examples 
A. Empirical mutual information 

The empirical mutual information is probably the most intuitively appealing rate function. It was presented in |[T1, and 
revisited throughout the current paper Below we review the main results regarding this rate function and discuss the overhead 
related to attaining it. 

The alphabets X ,y are assumed to be discrete. We have: 



In other words, / is of the i?™* form which is upper bounded by the R"^-^„ form (see Section VI and Section VI-Cl I. By 
definition, the respective i?emp form guarantees this rate function asymptotically equals or exceeds the best reliably achievable 



rate (with the given prior) over any memoryless channel model (Section VI-B 1, and since they are equivalent in high probability, 
-Rcmp — ^ will asymptotically achieve this guarantee as well. In the case of the empirical mutual information it is easy to see 
this claim holds - since for every memoryless model /(x, y) will tend to the statistical mutual information /(X; y)r| which 
upper bounds the attainable rate. 



In Section VI-E2 Lemma B] we saw that it is essentially, but not strictly speaking, the optimal rate function defined by 



zero-order statistics (asymptotically). 

'by the law of large numbers the empirical probability tends to the letter probability, and the claim follows from the continuity of the mutual infomiation 
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log n 



(this is the dominant term from Theorem 



The redundancy of attaining / is upper bounded by Theorem 10 with z = y. In the non adaptive case, / is achievable up to 

_ (i^i-i)-iyi " 



10 



factor in the overhead becomes (5„ defined in Theorem 8 which is (5„ 

A lower bound on the redundancy for the R^^p form ^ log q,^)' 
Section [VI-B2[ writing 

^NML(x|y) 



assuming e is constant). In the adaptive case, the dominant 
. 2^/^^ (for large n). 
can be obtained via Lemma 4 and the discussion in 



nML 



1 1 P(x|y) (74) 1 Cnml 

— log ^ — ■ — - ^^ — log 

n Q{x.) n Q(x) 



1, PNML(x|y) , 1 



lo. 



0(x) 



lOgCN 



(193) 



The term log c^ml is the minimax regret which in this case is known up to an additive factor to be log Cni 
||T9l Section IX]. By Lemma 4 the first term in the RHS of ( |193| l requires redundancy of at least 6q 

' ' ^ /I VI 1^I'\H rt , 



(\x\-i)-\y\ 

loK 



the redundancy in attaining Rl'^^p it at least <5o + ^ log c^ 



(m-i)-iyi-2 



Therefore 
-^&^. The redundancy of the empirical mutual 



information itself can be bounded based on the method of types and Theorem l2] but this bound is looser 



B. Markov sources and stationary ergodic models 

The empirical mutual information is drawn from the i?*4p construction with a memoryless model. Therefore it is not able 
to exploit memory in the channel. In a simple example where i/i — xi^i the empirical mutual information tends to while 
the capacity of the channel is log \X\. 

An immediate extension is to replace the memoryless family of distributions with a Markov model. The simplest model could 
be one in which Xi is a A:-th order Markov process (the probability of Xi is given as a function of X*I^), and the probability 
of Yi is given as a function of Xi and the fc-th order history X*I^, Y*2fc- In this case, since the probability of {Xi,Yi) is 
given as a function of X*~^, Y^'"^,, the pair {Xi^Yi) is a fc-th order Markov process. Unfortunately, Y; alone is not a Markov 
process but a hidden Markov process (HMM) which has a more complex structure. As a result, the conditional distribution 
P0(X"|Y") (where 6 indexes a specific Markov model) does not have a simple closed form expression. Even values such as 
the the size of the conditional Markov type or the conditional entropy rate (which would be needed to characterize this rate 
function via Theorem l6]) are related to the entropy rate of HMM-s which does not have a closed form expression (see for 
example fSSl). 

To circumvent this problem using a more general characterization, suitable for stationary ergodic channels. Since R'^^^p is 
based on modeling Pe(x"|y") we associate the parameters with the conditional distribution, by giving the probability of Xi 
given the D past letters of the input X*l]j and the past and future of the output y^l^,. I.e. 



Pe{^''\yn = X{0{x.\^ilWi'-D) 



(194) 



where 9{-\-) : X^^^ x 3;2_d+i _^ jg^ ^J is a set of conditional probability functions which is the parametric space. Regarding 
times i < _D in which the past D samples are not defined, we may either define an arbitrary initial state, a special value (which 
effectively increases the y alphabet size by one, and is equivalent to defining special probability functions for these times), or 
avoid communication during these times (treat them as a training sequence). To simplify the discussion below we adopt the 
first solution, although it is easy to modify it. 

The probability P0(x"|y") is D-causal (Definition 9l. Defining the state variable Zi — (x^^^,y^^^), this distribution falls 
into the category of conditionally memoryless distributions. Hence, the maximum likelihood distribution equals the empirical 
distribution (and similarly for the entropies, see Section [VI-A i. From the same model class we may extract a Z?-order Markov 
characterization of the probability of x, therefore it makes sense to choose Q as any ZJ-order Markov distribution (note that 
this is only required for the inequality R^^* < Rt^^,„ which is needed for proving the achievability of R^ll^'„. Thus in this case 
we have the following information measures: 



'"cnip 



cinp ■ 



DML 

-^cmp 



1 ^ p(x|z) 1 p{{x. 
n y(xj n 



I J-l i+D 



)r=i) 



To write i?^^^* we split the state vector into 



i-Di -'I/,* 



Q(x) 



= ifQ(x) -^(x|z) 



'"^-^ and write: 



DML* 

-^emp 



=-iog4^ 

n p(x|z^) 



H{x.\z.x) - H{x\z.j:,Zy) = /(x; Zj,|z 



(195) 



(196) 



These rate functions are adaptively achievable by Theorem 10 The redundancy due to the complexity of the parametric 

1„ ^ (l^l-l)-l^l log" ^ (|Ar|-l).|A'|°.|y|^°+i logn 
2 n 2 



family is 



:r„ 



Theorem 



D, the adap 



10 1, while the redundancy due to adaptation is 5n — O i \ -^^ 

ive rate scheme is able to estimate the conditional probability of a symbol x. 



(this is the dominant term, the full expression appears in 

(see Theorem 8 1. Note that because of the delay 

only after yi+n was received, and 



therefore the last D input symbols of each block are "wasted" (at time i the decoding metric considers only x^^ ). 
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By definition, for any channel that satisfies the model P6/(x"|y"), the maximum likelihood rate function yields an average 
rate which is at least as large as maximum attainable rate with the given input distribution. By taking Z? — > oo, this model is 
able to account for all stationary ergodic channels, i.e. channels in which the joint distribution of the processes X, Y is time 
invariant. Of course, in order that the redundancy still tends to 0, D can be taken to infinity only at a logarithmic rate, eg. 

From another point of view, if the processes X, Y are stationary ergodic, then 

i?r^;(X; Y) = /(X;Z,|Z,) ^ /(X,; Y|+g|X^:^) 

^'°\- _ (197) 

^ /(X,;Y|Xri).~^/(X;Y) 

where the convergence in probability is due to the law of large numbers (convergence of the empirical probability) and 
true for any i > D (therefore we may take i — > oo), and the last relation is due to I{Xi;Y\X.]^^) — H{Xi\X.]^^) — 
H{Xi\X[~^,Y) — > I7(X) - H{X.\~'^\Y) QH Section 4.2]. This shows that when the channel is indeed stationary ergodic, 
the rate function proposed tends to the mutual information rate of the channel, which upper bounds the achievable rate (with 
the given prior). 

C. Channel variation over time 

The stationary ergodic model does not cover all types of memory in the channel. Another type is a channel state that evolves 
irrespectively of the input (such as in fading channels). Note that in (static) Markov channels, i.e. when the state is just a 
function of the input, capacity does not improve with feedback. However if the state doesn't depend only on the input (but can 
also evolve randomly), then capacity improves with feedback (since it improves the transmitter's guess as to the state) ll24l . 
While in channels of the first kind, we are able to reach the capacity, which is also the feedback capacity, with the "individual 
channel" model (and the above rate function, with the right prior), in channels of the second type, our model, in which the 
input distribution is determined a-priori will create an inherent limitation, since the best rate is achieved by modifying the 
input distribution. 

However, if we target the mutual information (rather than the feedback capacity), a suitable rate function can be devised by 
modifying the model such that the conditional probabilities may slowly change with time. Naturally, the redundancy associated 
with such a model will not tend to with n, but behave like | -^^|^ where d is the number of parameters and T measures the 
coherence time (the typical referesh rate of the conditional distribution, e.g. the length of the fading block in a block fading 
model). This non-decreasing redundancy reflects the loss in rate from learning the channel in each coherence epoch (in a 
statistical setting this would be reflected by the difference between the known-channel mutual information /(X; Y|0) and the 
unknown channel mutual information /(X; Y)). 

It is easy to see the source of this factor, in a block fading model. The maximum likelihood probability of x given y 
is a product of maximum likelihood probabilities of each block. Each of these is related to a legitimate ( NML) probability 



by the normalization factor c^ml {T) where for large T, log c^ml [T) w | log T (see ( |76] l and Section VI-B2 1, therefore 
PML(x|y) is related to a conditional probability function by CNML(r)"/^, and this affects the overall redundancy in a factor of 

ilog(c™L(T)"/^) = ilogc™,(r) « ^logT. 

D. The modulo additive channel 

Shayevitz and Feder's results f3| for the modulo-additive channel (with X — y) can be interpreted as asymptotic adaptive 
achievability of the rate function 

i?cmp = log|A'|-iJ(y-x) (198) 

where y — x refers to letter by letter modulo subtraction. This rate function is easily outperformed by the empirical mutual 
information when using a uniform i.i.d. input distribution [1, Section TBD], since -ff(x) — > log|A'| while _ff(x|y) = 

Prob. 

H{y — x|y) < H{y — x). On the other hand the redundancy of attaining this rate function (the part relating to the model 
complexity) is smaller due to the smaller number of parameters. This rate function can be identified with the maximum 
likelihood rate function, -R"^p with Q{x.) — \X\^" (uniform) and where the noise sequence y — x is modeled as an i.i.d. 
sequenc e, i.e. Pfl(x|y) — JliLi ^(^i ~ Vi)- The intrinsic redundancy will therefore be bounded by w - — ^ • -^^ (see 



Section 



VI-B2i. The actual redundancy of the adaptive scheme is again dominated by (5„ = O J -2ML \ of Theorem 



However this convergence rate is significantly better than the attained by Shayevi tz and F eder's scheme |!3] Section V.C, Table 




I], which is approximately n i/32j4j jjj ^ straightforward way, as done in Section VIII-B i, the rate function can be exte nded to 
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Pomp — log \X\ — "ff (y — x|z) where z denotes the past of the assumed noise sequence Zi = (x*_]j — y\J^D)- In Section 
below we extend this result further by replacing H{y — x|z) by the normalized conditional compression length -L(x|y) 
attached by any sequential compression scheme for the sequence x given y (and in particular the normalized compression 
length attained for the noise sequence y — x by any compression scheme). 

"^This is the convergence rate of f.2{n) according to tiie pai'ameters chosen in Section V.C, with a target to only show convergence. 



E. Rate functions based on compression schemes 

A result generalizing the empirical mutual information and its stationary ergodic extensions (Section |Vin-B| l for case of a 



uniform input distribution, as well as Shayevitz and Feder's result |3| from Section VIII-D is the asymptotic attainability of 
the following rate function: 

i?emp-log|A'|--L(x|y) (199) 

n 

where L(x|y) is the compression (output) length of the sequence x when the sequence y is given as side information. In the 
non adaptive case, this rate function is asymptotically attainable for every uniquely decodable code, while for the adaptive 
case we need to assume the compressor is "sequential" (which will be formalized below). 

1) Attainability: In the non adaptive case this directly stems from Kraft's inequality X)x^-''^P(~^(-''-ly)) < 1 - we can write 
-Romp = ^ log q}^\ where /(x|y) = c(y) cxp(— L(x|y)) is a legitimate conditional probability with c(y) < 1. Formally, 



using the Markov/Chernoff bound (Section V-B i 



fJ-qiRcmp) "^ -^ogLp^t.n = -logE[exp(ni?cmp(X,y))] 
n n Q 

1 ^ (200) 

= - log Vg(x) • lA-r exp(-L(x|y)) < 
n -^ — ^^ „ ^ 
=1 

Another way to prove the same result is by using the fact there are at most exp(r) sequences with i(x|y) < T, and 

that the total probability of these sequences is therefore at most '^f^ „ , and therefore Q(i?cmp > R) = Q(-^(x|y) < 

n{log\X\ - R)) < ^Mniiog\x\-R)] ^ exp(-ni?), therefore by definition ^ ^Q(i?emp) < 0. The fact that we obtained 



a lower intrinsic redundancy than the one of Section |Vni-D is not surprising, since some of the redundancy is hidden in the 



compression length itself. 

For the rate adaptive case additional assumptions are needed. We assume the sequential compression scheme receives Xi and 
fji sequentially (for i — 1,2, . . ., and occasionally outputs encoded bits representing x. There is an additional input causing 
the machine to terminate (i.e. declaring the input pair as the end of the block), in which case it may emit additional bits that 
terminate the encoded block. The decoder is required to be able to reconstruct x (not necessarily sequentially) when y and 
the encoded bits are given. 

Define L5(x|y) as the unterminated coding length, i.e. the length of the output of the encoder after the input x,y has 
been fed, but the sequence has not been terminated (i.e. the encoder is expecting additional input), and iT(x|y) ~ L(x|y) as 
the terminated coding length, i.e. the length of encoding the complete sequence. The sequence x is uniquely decodable from 
the L-ri^ly) bits of the terminated code, but not necessarily from the L5(x|y) bits of the unterminated one. The difference 
LT(x|y) — Ls(x|y) > is the information stored in the encoder which has not been output yet. We require that: 

1) The difference between the terminated and unterminated lengths is bounded by an asymptotically negligible value: 
i(LT(x|y)-Ls(x|y))< ^AUn) -^ 

This can be considered an embodiment of the limitation to "sequential" encoders and precludes encoders that need to 
process the entire sequence in order to produce outputs. 

2) The encoding length does not decrease when the sequence is extended: Lt {^ilyi) > LT{x\^^\y]~^) 
Consider the system of Section VII-A with the decoding metric 7/)(x'^,y'^, j) defined by: 



log^(x^y^J) = (fc- j) -loglA-l ~ (LT(x^ly') -iT(x^ly^)) (201) 

I.e. the metric compares the encoding length accumulated from j to k with the encoding length of a random sequence. If this 
difference is large, then ^'j^i is assumed to be related to y. We denote A2(n) = max{Ai(m)}J^^]^. 

We begin by evaluating the CCDF condition of TheoremPTl In order to bound Pr {?/;(X'^,y'^', j) > t|x^} we need to bound 

the number of sequences x'^^-^ that satisfy this condition for given y'' and x^. Suppose that we insert x^, y^ and then further 
append them by x^^]^,y*^_|_j^ and terminate the encoding. Consider the length LT(x'^|y'^') — L5'(x^|y^). This is the number of 
bits emitted by the machine between times j and k, and these bits uniquely encode the sequence x'j^i (i.e. it is possible to 
reconstruct x^^j^ from x''',y and this bit sequence). Therefore the number of sequences that are encoded by less than T bits 
is at most exp(r), and therefore their probability (over Q{x.'!|^-^^\x^)) is at most P^?fc_j . I.e. 



Pr {iT(x'=|y'=) - is(x^|y'') < T|x^} < ?^ (202) 
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Therefore 

Pr{V(X^y^J) > t\^'} ^Pv {LriAy") - Lri^^y^) < (fc - j) ' log jA"! - logi|x^} 

Q Q 

Assumption (1):Lt'^^s+A 

< Pr {£T(x^|y') - Ls{x'\y') < (k - j) ■ log \X\ - logi + Ai(n)|x^} 

ip exp((fc - j) ■ log \X\ - log t + A^(n)) 

t 

(203) 

which satisfies the CCDF condition of Theorem It] with _L.,„ ~ exp(A^(n)) (this holds for all m therefore &o = 0). 

The summability condition is satisfied using the assumptions above: given a set of segments {jb,ki,}j^^^ as defined in 
Theorem [tI with X]b=i(^& ~ Jb) — n ~ niQ, we extend the sequence by defining Jb+i — n, and write: 

B B 



^ log 7^6 = Y, [ih~jb) ■ loglA-l - (LT(x'=1y'^) - iT(x^1y^'^))] 



6=1 b=l 



Assumption (2) j,,4_i>fc[, t-^ ^ . - , 

> (n-mo)-log|A'|-^[LT(x^''+My^''+^)-iT(x^1y^^)] (204) 

6=1 

= (n - mo) • log \X\ ~ [LT(x"|y") - Lri^^^ly^^)] 
> [n ■ log \X\ - iT(x"|y")] - mo • log \X\ 
= log^Pi; -mo-\og\X\ 

Therefore the summability condition of Theorem It] is met with /o(V'o) = logl-^l- The values c„,5i of Theorem It] evaluate 
to c„ — log ^^^^ — log j^ + A^(n) and 6i = 6o + 2(iFB ~ 1 = '^d^s — 1- Since our rate function is upper bounded by 



^max = log I'^U aiid /o is Constant, we obtain the following result by substitution in Corollary 7.2 



Theorem 11. Given a sequential source coding scheme with input symbols from alphabet X that satisfies assumptions (1,2), 
and assigns a codeword length q/i(x|y) to the sequence x G A"" given y G 3^", then the following rate function is adaptively 
achievable 

i?emp = log|A'|--L(x|y) (205) 



n 



up to dn, where 



/log 1^1 ^__n 



3j^^- (log-^-+A^(n) + (2dpB-l)-log|A'|) -^0 (206) 



and A'Kn) = max{Ai(m)};jj^p 

Note that the decoding metric ( |201| l in this case is a difference of two values of the form A^^ = k ■ log \X\ — _L(x'''|yfc) 
that can be interpreted as the "incompressibility" of the sequence up to time k (the gap between the compressibility of the 
hypothetical noise sequence, and the compressibility of a random sequence). It is interesting to give an interpretation of the 
rate adaptive scheme of Section [Vll-A| using Nk- Recall that to terminate a block, the decoder compares the decoding metric 
against a threshold. Ignoring the overhead terms this threshold is approximately exp(i^) (see ip* in Theorem ItI, therefore 
the termination condition may be interpreted as decoding when the value of Nk increases by K from the start of the current 
block. For random sequences (the codewords that were not transmitted), Nk is not expected to increase (the compression 
length is approximately logjA'l per symbol), and K reflects the value of the threshold needed to make sure the probability 
of a random sequence appearing to be "compressible" is small. When Nk increased by K, the termination condition is 
satisfied, and we begin a new block, therefore is a correspondence between the increase in Nk and the number of blocks 
and bits that are transmitted, i.e. the termination condition can be approximately interpreted as Nk > K{b + 1) where b is 
the number of blocks so far. Therefore assuming by time n, B blocks were transmitted, the number of transmitted bits is 
K ■ B Ki Nn — n ■ log \X\ — L(x"|y"). This is depicted in Figure Is] where the horizontal axis is the time k. The solid line 
presents L(x'^|y''), and the dashed line Nk- The decoding thresholds Kb (6=1,2,.. .) are depicted as horizontal lines, while 
the vertical lines depict the decoding times. We can see that a decoding occurs whenever Nk crossed a threshold. 

2) The modulo additive case: A specific case of the rate function proposed here is obtained for the modulo-additive channel 
when using a non-conditional source encoder operating over the (hypothesized) noise sequence z = y — x, i.e. 

i?emp = log|A'|--i(y-x) (207) 

n 
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Fig. 8. Illustration of the decoding rule of the rate adaptive system. L(x''|y''') is the compression length. Decoding thresholds with respect to Nk 
k ■ log \X\ — Z/(x'°|y''') are depicted by horizontal lines. 



In this case -L(y — x) can be considered a generalization of the notion of empirical entropy, and therefore generalizes the 
rate function ( |198| l presented previously for this channel. 

It is specifically interesting to consider an application of the Lempel-Ziv algorithms (LZ77 ||251 or LZ78 ll26l ). since their 
compression rate asymptotically reaches the finite state compressibility of the noise sequence /o(z), which surpasses empirical 
entropies of any order This substitution can be used to prove the universality of the system of Section VII-A attaining ( |207| l, 
over any finite-block length system operating over the modulo additive channel llZTl . 



We need to show that LZ77 [25 1 and LZ78 [261 fulfil the assumptions of Theorem 11 Both algorithms operate by creating a 
dictionary from previous symbols in the string, compressing a new substring to a tuple containing its location in the dictionary, 
plus, possibly one additional symbol. In LZ77 the dictionary consists of all substrings that begin in a window of specified 
length before the first symbol that was not encoded yet. LZ78 parses the string z into phrases. Each phrase is a substring 
which is not a prefix of any previous phrase, but can be generated from concatenating a previous phrase with one additional 
symbol. The dictionary contains all phrases. 

It is easy to make sure that Lt is monotonous (Assumption (2) of Theorem [TT] i. This depends on the way the last phrase 
in the string is treated (and does not affect the asymptotical performance), since this phrase may be an incomplete substring 
of a string in the dictionary, and therefore does not naturally terminate and produce a tuple. If, for example, the last phrase 
is sent without coding, then Lt will not be monotonous (since adding more symbols to z that will terminate the phrase will 
result in a shorter compression). A simple treatment is to encode the last phrase similarly to other phrases - refer to one of 
the phrases in the dictionary which is a prefix of the remaining substring, and always give the length of the last substring (or 
the length of the block) at the end. This way the compression length associated with the last substring does not decrease when 
the substring is extended. 

In order to bound Lt(z) — Ls{z) (Assumption (1)), we need to bound the tuple which encodes the last phrase. In LZ78 this 
tuple carries an index to a previous phrase, plus a new symbol. The number of previous phrases is bounded by n (a coarse 
bound, but sufficient for our purpose), and therefore llT4l Lemma 13.5.1] its encoding will be of length logn + loglogn + 1, and 
the length of the tuple will be log n+log log n+c (where c is a constant accounting also for rounding, encoding of the additional 
symbol, etc). Therefore, if we end the block with an indication of its length we have total Alz7s (n) < 2 log n + 2 log log n + c. 
In LZ77 this tuple carries a pointer to the window and a length (i.e. two numbers bounded to {!,... ,n}). Therefore after 
adding an indication of the length at the termination we would have A^zTiin) < 31ogn + 3 log log n + c. In both cases 
^Lzin) — 0(log7i) and the requirement is satisfied. 

3) A converse for the modulo additive case: An interesting thing to note is that all rate functions that depend only on the 
noise sequence i?cmp(x, y) = R{z) (z = y — x), can be written in the form i?(z) = log \X\ — -L(z), where L a compression 
length. 

Two way to see this is by using the achievability of -R(z) to bound the maximum number of sequences with R{z) > R, 
which then bounds the number of sequences with L{z) < nlog \X\ — nR{z), and we can show that Kraft inequality is met. 
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Since R{z) can always be written as 

i?(z)-log|A'|-ii(z), (208) 

the purpose is now to prove that for any achievable R{z), L(z) satisfies Kraft's inequality. Rounding issues are ignored as 
their effect is at most 1 bit, so L(z) is allowed to be non-integer The input distribution Q(x) is not limited to be the uniform 
distribution. Choose a fixed y and define the random variable Z = X — y. Then, taking any 7 < 1, the necessary condition 
of Lemma [T| yields: 

E [exp(n7i?(Z))] < ^^ _ ^^^^ _ ^^ , (209) 

Because the above holds for any y, the same inequality holds for Y generated randomly and uniformly over A"". In this case, 
irrespective of the distribution of x, Z becomes uniformly distributed as well. Therefore: 

z..^.^) [-P(-^^(2))] = ^ E exp(n7i?(z)) < (,_,)(,_^) - (210) 

This can be written as: 

J2 exp(n[7i?(z) - log \X\] + log(l - e) + log(l - 7)) < 1 (211) 

z 

I.e. the following encoding lengths 

L'(z) =n\og \X\ - njR{z) - log(l - e) - log(l - 7) 

RriRl (2 1 2) 

™ 7L(z) + n(l - 7) log \X\ - log(l - e) - log(l - 7) 

satisfy Kraft's inequality J^z 6xp(— i'(z)) < 1. Since ^L{z) is shorter (better) than i(z), 7 is chosen to minimize the overhead 
terms the second and fourth terms of ( |212[ )). The optimal 7 is 7 = 1 — ^^j ^^y, , which when substituted above yields: 

L'(z) = 7L(z) + log (^^^^) < L{z) + 5l. (213) 



To make L'{z) feasible encoding lengths one may have to add an overhead of 1 bit. This is summarized in the following 
theorem: 

Theorem 12. /jf i?omp(x, y) — logjA"! — -L(x — y) is an achievable rate function (with e,Q{x)), then \L{z) + 6l~\ are 
feasible compression lengths (i.e. satisfy Kraft's inequality) where 6l — log I "^-^"[ ). 

Note that the overhead 5l satisfies -5l — > and is therefore asymptotically negligible. Combining this with the positive 



result of Section VIII-E2 implies that every rate function which is a function of only the noise sequence z = y — x, is 
asymptotically bounded by the form log \X\ — -L{z) (for some compression lengths L{z)). 

Another interesting way of proof is to generate a compression scheme from the encoder and decoder: suppose we use the 
decoder to decode the message from y, re-encode it to obtain x, and calculate an estimate of the noise z = y — x. Suppose 
we run all combinations of nR{z) bits as inputs to the encoder, then take the output and pass it through the channel with a 
specific noise sequence z. Then we obtained 2"'"^^) different sequences y, 1 — €3 of which will be mapped by the previous 
machine to z (s denotes the common randomness, and we know that on average Egeg < e). If we generate y at random 
(uniformly), the probability of the machine to output z is at least T^ynr = 2~"('°S2 l-^l-^^Cz)) Now to encode, we generate 
for each coded sequence, in each length (i.e. '0' ,' 1' ,' 00' ,' OV , ..., which to be a prefix code needs to be added a length 
indication) a random choice of a y sequence, and pass it to the previous machine to generate a z sequence. The encoding 
of a sequence z is done by taking the first coded sequence which generates z in the generated codebook. Since we have at 
least 2™ sequence until exhausting all combinations up to length m, and the probability of each one to produce z is at least 
2-n(iog2 \x\-R{z))^ ^g ^^jj ggg jjj^j jjjjg probability will be high if i(z) = to is sHghtly larger than n(log2 \X\ — R{z)). More 
accurately, the probability that the length will be higher than to, i.e. that all words up to length m will not produce z is 
(1 - 2~"('°S2 \x\-R(z.})-j ^ ^-2" " °82 ^ ^ ^^ .^ decays very quickly after this point. 

4) The conditional Lempel-Ziv: We now consider another interesting substitution in i(x|y) for the general (non modulo 
additive) case, which is the conditional Lempel-Ziv algorithm, described e.g. by Ooi ll28l Section 4.3.1]. This algorithm based 
on LZ78 1 26 1 performs Lempel-Ziv incremental parsing of the combined sequence (xi, yi). With this parsing each x phrase is 
associated with a y phrase. Then for each phase the algorithm sends the last letter of the phrase, plus the index of the phrase 
obtained by removing the last letter, out of all phrases with the same value of y. The assumptions of Theorem [TT] are met in 
the same way as they are for the non-conditional case (the output phrases are of same or smaller length). 

Note that the metric that results from using the conditional LZ we i(x|y) is similar to the metric used by Ziv lfT6l in order 
to construct a universal decoder that attains the maximum likelihood error exponent for all finite state channels. Ziv's metric 
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which was later termed the conditional LZ complexity f29l (see ( |337| i) refers directly to the number of phrases generated for 
each y-phrase, and can be shown to be asymptotically close to the L(x|y). Furthermore the conditional LZ algorithm was 
used by Ooi [?] for constructing a universal communication scheme for finite state channels based on iterative compression. 

The results known for the non-conditional LZ such as Ziv's lemma [14| can be extended to the conditional case f29l, and 
therefore for every stationary ergodic channel with a stationary ergodic input, the compression rate tends asymptotically (for 
n — ?> oo almost surely) to the conditional entropy rate -i(X|Y) — > i7(X|Y) 1291 Theorem 2], and hence our rate function 
tends to the mutual information. 

The probability /-^^(xly) = cxp(— L(x|y)) assigned by the conditional LZ to an input sequence, asymptotically surpasses 
(up to vanishing factors) the probability that can be assigned to the sequence by any finite state machine operating on the 
sequences x, y. Since we have not found an explicit derivation of this result we show this explicitly in |9|. Therefore considering 



the setting of Section VI-F using this rate function we can compete with the performance of every maximum likelihood decoder 
using a finite state characterization of the channel (this is not surprising given Ziv's results lfT6l . and especially related to his 
Lemma 1). Therefore the current result gives us another angle on Ziv's result regarding the finite state channel: while Ziv 
considered competing systems operating at the same rate, and showed that the system using the conditional LZ complexity as 
a decoding metric achieves the same error exponent universally, here we may compare against systems operating at different 
rates (tuned to specific FS channels), and show that the rate adaptive system attains at least the rate obtained by any of these 
systems (however we have a suboptimal error exponent). 

Another possible candidate for L(x|y) with similar properties (but possibly better convergence rate) is the conditional version 
of the context tree weighting algorithm |30|. 

5) Kolmogorov complexity?!: 



F. Second order rate function for the MIMO channel 

In the previous paper fTl| we presented the rate function ^ log j^^ where p is the empirical correlation factor for the real 
valued channel M -^ M and showed it is asymptotically adaptively achievable. In this section we extend this result in several 
directions: we consider a MIMO channel with t transmit and r receive antennas, where the components may be real or complex 
numbers (i.e. M* — ?► W or C* -^ C), and where the correlation matrix or alternatively the covariance matrix may be used to 
define the rate (the difference being in subtracting the mean before taking second moments). The non-adaptive attainability of 
the rate function for the real-valued MIMO channel was shown in a conference paper |5| on the subject. 

We have altogether four cases (complex/real, covariance/correlation), for which the results and the techniques are very 
similar. In order to avoid duplication, we will prove them together (and apologize for the additional complication caused). 
For that purpose, we define d as the dimensionality of the input, i.e. 1 for real valued and 2 for complex input, and u as an 
indicator whether the mean is subtracted, i.e. u = f or correlation matrices, and u = 1 for covariance matrices. The input and 

{IR d ^ \ 
. For a matrix A, A* denotes the conjugate-transpose 

of A. We use 1 to denote a column vector of 1-s, whose dimension is implicit. 

We collect the input vectors over n symbols into the n x t matrix X and similarly the n x r matrix Y denotes the output. 
The rate function is given as a function of X, Y. We denote sub-matrices similarly to sub-vectors, i.e. X^ denotes the matrix 
composed of rows j to k of X. 

Although the result here is stronger, the proof in the conference paper ||5l is more intuitive than here. Here we use similar 
techniques but the proof is more complex due to the need to show adaptive achievability and the other generalizations mentioned, 
and some of the intuition may be lost. 

1) The Gaussian parametric family and the maximum likelihood distribution: The rate function we present is based on the 
maximum likelihood construction ( |73| l relating to the Gaussian i.i.d. family of distributions. In this section we present the 
distribution and its associated maximum likelihood probability. The parametric family defining the joint distribution of x and 
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y is the family of Gaussian or complex Gaussian i.i.d. distributions: 

\cN-{fixY,AxYr, MxyeC*+^AxyeC(*+'^)x(*+'^) u = l,d = 2 
[cAf{0,AxYT, AjfY eC(*+'-)^(*+'^) u = 0,d = 2 

Using the maximum likelihood rate function ( |73] l over this family, guarantees attaining the mutual information for every 
Gaussian memoryless MIMO channel (where the input and output are jointly Gaussian). 

We would like to find the maximum likelihood probabilities for the families above. We start with the non-conditional case, 
i.e. the maximum likelihood probability of a vector (which we denote by x, but it may be a concatenation of x, y). In the 
non-conditional form, each of the n rows of X is modeled as a Gaussian random vector A/^(/iixt, A^xt), independent of the 
others. The probabiUty density of a single row x (a row vector) in the real valued case is: 

F^^a(x) = |27rAr^ g-i(x-M)A-^(x-AO^ X e M* (215) 

In the complex-valued case, we have insteadP] 

P^,a(x) = IttAP' g-(x-M)A-i(x-M)* X G C* (216) 

Where in both cases /i = Ex and A = E(x — u)* (x — u). A is non-negative definite. Note that in the complex case, the power 
of each component of x is split between the real and imaginary components). In general we can write: 

P^ a(x) = IdTrAf''/^ g-f (x-M)A-'(x-p)* X e B* (217) 

To obtain the rate function based on correlation matrices (u = 0) we will degenerate this family by fixing /i = 0. For brevity, 
in the rest of the section, we will use the word "Gaussian" to refer to both Gaussian and complex Gaussian vectors. 

Considering the n x t matrix X ~ {^J , ■ • • , x^) where the rows are i.i.d. and distributed according to ( |217| i, we have the 
following distribution for the matrix: 

?i 
P^,a(X) = nP^,A(x,) = |d,rAr^"e-t5:r^i(-'-^)^-^(--'^)* 

fi (218) 

= |d7rA|"5"e-^H(X-l-M)A"'(X-l.M)*) tvAB^U-BA |^^^pfn^-ftr((X-l.M)*(X-l.,OA-^) 

We would now like to find the find the ML estimate of /i and A given X. For u = we fix /i = and optimize ( |218[ ) 
with respect to A. It is intuitively clear that for u = 1, /Iml is just the empirical mean /i^L = -1"^ • X, and that Aml is the 
empirical covariance (u = 1) or correlation matrix {u = 0) Aml = -(X — 1 • /iML)*(X — 1 • /xml) (where for m = we just 
take /iML = 0). 

To prove this, we first maximize ( |218| l with respect to /i, which implies minimizing tr ((X — 1 • /i)*(X — 1 • /i)A^^). 
Defining 

X, = X - 1 • //ml (219) 

we have that 1^ • Xc = and therefore: 

tr ((X - 1 • m)* (X - 1 • m) A-i) = tr ((X, + 1 • (Aml - m))* (Xc + 1 • (Aml - m)) A"^) 

= tr (X^XcA-i) + tr (1 • (Aml - m)*(Aml - m)1^A-1) 

The second term is non-negative and is minimized for fi — Aml- 
Substituting /i — Aml in ( |218| l we obtain 

maxP„ a(X) = |d^Ar^"e-^'^(^^^^^"') (221) 

Where Xc is defined by ( |219| l (where for u = we fix Aml = 0). It remains to maximize the above with respect to 
A. We change optimization variable by defining A = X^XcA^^; The determinants of the two matrices are related by 
In |A| = In IXjXcl —In |A| = const — In |A| so taking the logarithm of ( |221| i and removing constants, it remains to maximize: 

nln|A|-trA (222) 

with respect to A. By Hadamard inequality since A is non-negative definite, | A| < nj=i ^a (with equality iff A is diagonal), 
therefore ( |222[ ) is upper bounded by J2i=i ("In Aji ~ ^u), which is maximized for A,,; = n. The upper bound can be met 

^It is easy to produce this distribution by taking a complex Gaussian vector who's real and imaginary parts are i.i.d. distributed A^(0, i) and multiply it 



(220) 



by A5 



1 
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by choosing a diagonal A, and therefore we have A = n ■ Itxt- Changing variables we obtain the ML estimate of A is the 
empirical covariance/correlation: 



Aml(X 

Substituting the result into the probability density we obtain: 

1 



X„ Xc • A — ^X Xf, 

n 



(223) 



Pml(X) - -f/i(x),A(X)(^) 

d 
1 ^ 

n 



diT — 'K^ Xf 
n 



^ftr(x;f'Xe(ix;?'Xe)"') 



(224) 



dne^ 



-XfX, 



Note that Pml (X) diverges when the columns of Xc are linearly dependent. 

We now discuss the conditional case. Assume [x, y] are jointly Gaussian row vectors of sizes t, r respectively, with means 
[/ia;,/iy] and covariances K^XT^yyi^xy Then the conditional distribution is known to be Gaussian as well with: 

,-3(x-Mx|j,(y))A^i' (x-Mx|„(y))* 



Ma: iMy i-'*-a;a; •:^^yy i-'*-j 



,(x|y)= dTTA 



x\v\ 



where 



^J'x\y{y) = ^J'X + {y- Hy)AyyAyx Ax\y = AxyAyyAly 

For our purposes, it will be convenient to define the conditional distribution by a different set of parameters. We write: 



Pei^\y) 



IdnA 



x\y\ 



g- f (x-yA-b)A;|i^ (x-yA-b) 



(225) 
(226) 

(227) 



Where 9 = [A[rxt]^^[ixt]7 A^^y. ,] is the vector of new parameters. yA + b is the MMSE estimator E [x|y]. For the case 
u = we fix b = 0. 

For matrices X, Y whose rows are distributed i.i.d. based on the distribution above, we have: 



3 Er=l(x.-y.A-b)A;|i^(x.-y.A-b)* 



IdirAr. 



(228) 



IdnA^ 



-fn^-ftr[(X-YA-l.b)-(X-YA-l.b)A-i^ 



P,(X|Y) = n^e(x.|yO - |d7rA,|,| ^' 

%n ~4tr[(X-YA-l-b)A",^ (X-YA-l-b)* 
^x\y\ ' e ' L^ -!«' ^ = \^1^l^x\y\ 

To find the ML estimator, we begin by maximizing with respect to A, b. This is a simple quadratic problem, but the algebra 
can be avoided, by considering it as an estimation problem. Consider the matrix Ag = ^(X — YA — l-b)*(X— YA — 1-b). This 
matrix can be considered as the mean estimation error covariance matrix in the following scenario: there is a linear estimator 
X = yA + b is sought, and the matrix above is the estimation error covariance matrix, when (x, y) are selected from the 
i-th row of [X, Y] and i ^ U{1, . . . , n\. In other words, when one seeks a linear estimator, which given a randomly selected 
row in Y will produce an estimate of the respective row in X. The LMMSE estimator brings the matrix A^ to minimum (in 
the matrix sense) and therefore would bring Pg to maximum. In this scenario, the covariances and means of (x, y) are the 
empirical covariances and means (since the rows are selected uniformly). Therefore the optimal linear estimator is 



yA + b = /tx + (y - /iY)CYVCYX 



(229) 



where 



A^x 

Ay 

Cyx 

Cyy 



1. 

n 

1 T, 

E,y, = -l^Y 
n 



1 



Ei^Yi - Ay) (xi - Ax) = -(Y - 1 • Ay)*(X - 1 • /ix) 

n 

E^{y^ - flYfiy^ - Ay) = ^ (Y - 1 • Ay)*(Y - 1 • Ay) 



Furthermore, after substituting A, b from (j229]) we will obtain in the exponent of ( |228[ ) the LMMSE error matrix (of the 
aforementioned scenario) which is: 



A 



LMMSE 



^XX 



C* f^ — 1 r^ A f^ 

YX'^YY^-^YX — '-^X 



|Y 



(230) 
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where Cxx is defined similarly Cyy- Substituting A^ in ( |228| l we have: 

I — - 
maxPe(X|Y) = |(i7rAa,|y| ^ 



#tr «-C, 



A,b 



(231) 



This can be also verified by direct substitution of ( |229[ ) in ( |228| l. In the case of u = 0, where we have b = 0, we are limited to 
linear estimators of the form x = yA. The solution in this case is to replace fix, Ay with zeros, and C. . with the respective 



correlation matrices (i.e. obtained without removing the mean). The proof is technical and appears in Appendix El 



We remain with the problem of maximizing with respect to A^iy, which is identical to the non-conditional case ( |221| i, 
where iX*Xc is replaced with Cx|y- Therefore the maximum in (|23T} will be attained for Aj,|y = Cxiy^ and the maximum 
likelihood distribution is: 



PML(X|Y)=maxP9(X|Y) = 



dnC 



X|Y 



-#tr n-C> 



rC 



x|yJ 



(232) 



dnC 



X|Y 



d7reCx|Y 



where Cx| y is a function of X, Y defined by ( |230| l. 

Note that if the columns of Y are linearly dependent, or are linearly dependent on the 1 vector (in the case u = 1), the value 
of ( |229| l is not defined. In this case, return to ( |228| ) and observe that the result of Pml only depends on the subspace spanned 
by the columns of Y (plus the vector 1) since this determines the values that YA — lb can attain. Therefore, removing 
linearly dependent columns from Y does not change the result (and it does not matter which columns are removed). 

We summarize the results of this sub-section in the following Lemma: 

Lemma 8. Let the matrix X be defined by an i.i.d. Gaussian Af{p, A) distribution (d — I) or a complex Gaussian CM{ii, A) 
distribution (d = 2) on its rows, as defined in ( |217[ ). Then the maximum likelihood probability, which is obtained by maximizing 
(|217[) with respect to /i, A (in the case u = 1) or with respect to 1^ for ^ — (in the case u = 0) is: 



Pml(X) = dTreCxx 



(233) 



where Cxx is defined below. When X is defined by a conditional i.i.d. distribution on its rows, conditioned on the respective 
rows of Y, as defined in ( |225| l or P28| l, then the maximum likelihood probability, obtained by maximizing with respect to 
(|228l) to = [A[,vfi,brivfi, A, 



[A[rxt],b[ixt], Aj-Ij^,^^^,] (where for u = 0, b = and is excluded from 9), is: 



Pml(X|Y) 
where the covariance matrices are defined as follows: 

m) = 



Czw 

Cx|Y 



d7reCx|Y 




= -(Z-/t(Z))*(W-/i(W)) 
n 

= Cxx Cyx^yy^yx 



(234) 



(235) 

(236) 

(237) 
(238) 



where Z, W are generic matrices which are replaced with li. or Y as appropriate. If Cyy is singular, the result is obtained 
by removing columns of Y until the columns are linearly in-dependent of each other (and the 1 vector, in case of u = 1). 

2) The maximum likelihood rate function: The input distribution is based on the the i.i.d. Gaussian distribution Af{Q,Ax)" 
or CAf(0,Ax)" (we always use mean zero even if u = 1). We define Q as the ideal distribution A/^(0, A^)": 



Q(x) 



|d7rAxr^"e" 



-ftr(X*XA^i) 



(239) 



Since Q(X.) is unbounded from below (for non-degenerate X, taking a — > oo yields Q{aX.) — > 0), the actual input distribution 
will be a trimmed Gaussian Q(X) which will be defined in the sequel. However the rate function will be defined with respect 
to the ideal Q. 



As in Section VI-D we can define the rate function by the empirical and quazi-empirical entropies: 



Hq{X) = -- log Q(X) = ^ log \d7TAx I + I • log e • ti- 



^X*X-A^A =^log|d7reA;f| 



^•loge-trQx*X.A^i-I 



iJM.(X) = -- logpML(X) *^ ^ • log 
n 2 



dTreCxx 
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Note the similarity to the expression for the entropy of a Gaussian random vector. 

d7reCx|Y 



7Jm,(X|Y) = -- logpM,(X|Y) t^ ^ • log 
n 2 



and the rate functions: 



RZ, ^ I log ^^ ^ H,iX) - H.MY) 



log ^ 



lloge.tr(lx*X.A- 



■^emp 



m 1 Pml(X|Y) (84} 
- log . '--' 

n Pml(X) 



-ffML(X) — Htj 



.(X|Y) = -.log 



^xx 



■'XIY 



(242) 



(243) 



(244) 



where Cxx,Cx|y are as defined in Lemma 8 Note the similarity of the maximum likelihood empirical entropies to the 
entropies of gaussian random vectors where the true covariance is replaced with the empirical covariance (or correlation) 
matrices (the entropy of Z ^ J\f{0,Az) is ^ log |27reA2|). Regarding the quazi-empirical entropy Hq{X.), it is composed of 
two parts: the first is the true (statistical) entropy of the channel input x, and the second part is a measure for the similarity 
between the empirical correlation matrix of the input and the average one. For typical X, -X*X sa Ax and the second part 
tends to 0. By definition (since Q belongs to the parametric family 8), we have Pml(X) > Q{X.) and Hq(X.) > //ml(X). 

The parametric class we defined is separable in the sense discussed in Section VI-A4 (Equations ( |67] i, (|68]l) 



ML V 

i.e. the joint 



Gaussian distribution of the vectors x, y (defined by the joint mean and covariance) can be equivalently defined by the mean 
and covariance of y, and parameters defining the conditional mean and covariance of x given y (or equivalently, the matrices 
A^iy, A and the vector b as in (|227|i). Therefore (|67]i, (|68]l hold with equality, i.e. we can write: 



7Jml(X|Y) = iJML(X, Y) - i?ML(Y) 



d 

2 



log 



d7reC(xY)(XY) 



d 
2 



log 



dTreCYY 



(245) 



Where C(xy)(xy) is the empirical covariance/correlation matrix of the matrix [X,Y]. Alternatively, this relation can be 
obtained by using Leibnitz formula 



\ A B ] 




\ A 0] 




CD 




CI 





To obtain the relation: 



^(XY)(XY) 



Cxx 


CxY 








Cyx Cyy 






Cyy 




/ 


CxY / 




C 


Cyy 




Cji 


:x 


- 


Cy 


L\ 



I 


Cyy 

CxY 



A- 
D-CA- 

Cyx 
Cxx 



'B 



(246) 



XX 



Cyy Cyx 
— CxyCyyCyx 



(247) 



xyCyyCyx 



^YY 



^X|Y 



Plugging into ( |245| l and noting that the factors dire are canceled out due to the matching sizes of the matrices, proves the 
relation. 

Using this equality we can alternatively write i?cmp ™ ^ symmetrical form ( (85] i: 



dml* 
cinp 



7^ml(X) + H,,,iY) - FmlCX, Y) - - • log 



■^xx 



C 



YY 



■'(XY)(XY) 



(248) 



This form was presented in a previous paper fSl for the case d = l,u ^ and was proven to be asymptotically attainable 
(non adaptively). In that paper, the rate function was justified based on different considerations, of convergence to the mutual 
information for Gaussian channels 

3) Achievability of the rate function: In the Gaussian case, the parametric class is continuous, and Pml(X|Y) may take 
unbounded values (when the matrices are highly correlated). Therefore the achievability proof is quite involved and uses the 
tools developed in Section VII-F3 We will use the metric defined in \\6A) with a parameter 7 G (0, 1), which, using Theorem!?] 
and Lemma It] can achieve adaptively the rate function ^R!"^^ , and then take 7 — >■ 1. 

The main parts which are specific to the Gaussian case and need to be proven are: 

1) We need to bound Q: < Qniin < Q{xi\^'^~^) < 'Zmax < co. This is done by trimming the input probability. 

2) For the CCDF condition, we need to bound the quantity appearing in ( |166| l 

3) For the summabihty condition, calculate 5o(''/'o ) from \\12\ related to the unconstrained symbols. 

We first state the result. The proof is partially followed in the next sub-sections, while the more tedious parts are in the 
appendix. 
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Theorem 13. Consider the channel X ^ y, where the input and output are vectors of size t, r respectively X ^M^ ,y ~ B'', 

Ik rf = 1 

where each element is either real or complex valued B = < 



. Let the n x t matrix X and the nx r matrix Y denote 



d = 2 
the channel input and output respectively. 

Let the input distribution Q be defined by an i.i.d. generation of each symbol x^ (row of "X.) according to the following 
distribution: 

(249) 



Q{xi) = c ■ Ind(x*A^ixj < n^) ■ 6^3^*^^''''' 



Where Ax is a chosen positive semidefinite matrix, il is a chosen radius, and c is a normalization factor chosen such that 
Lt (5(x)(ix = 1. When il — > oo, (3(x) tends to the Gaussian or complex Gaussian distribution with zero mean and covariance 
matrix Ax- Consider the following rate functions: 






RZ; = 2 • i°§ 



d, \Ax\ 
log ■ ^ 

Cxx 



yloge-tr(^X*X-A^'-I 



^X|Y 



^ 1 

= 2 -log 



^xx 



^YY 



^(XY)(XY) 



— ^'cnip 



(250) 



(251) 



where Cxx,Cx|y ore either empirical correlation matrices (for u ^ 0) or covariance matrices (for u = 1), defined in 
Lemma |S] Then: 

1) F{R^^^) and ^(^cmp) '^''^ adaptively achievable, where: 



F{t) 



(252) 



l + at 

where ri,a,S are defined as a function of the transmission length n, f2, the feedback delay d^^, the number of bits per 
block K (a chosen parameter), and 7 G (0, 1) (a chosen parameter) as follows: 

-1 



V 



7(^1 + 

Ar. 



Br, 



K 



K + B, 



n,7 



5 — qq + 



B.^ 



^n,7 
n,7 

ao 
a2 



K 

n 
as 

1-7 



a4 



= log n + ai + a2 log 



as 



1 — 7 \ 1 — 7 



+ a4 • 7 • 05 



log 



1 
1 - 6n 



1 



ao + log h a2 log(e) 

rfpBe 

-{t + l + 2r + 2u)-t 



as = t + l + r + u 
a4 ^ 2aFB — 1 

as = -[t + n')-\og{e) 



5a 



f dt dn^ \ 

V2' 2 y 
r(f) 



2) i?emp '^"'^ ^cmp '"'^ asymptotically adaptively achievable with a sequence of priors defined by Q above (|249[) with 



"'emp """ -'"'cmp 

n — > 00 (i.e. with the input distribution tending to Gaussian) 



The proof is organized as follows: in the subsections below we discuss the modified input probability and the summability 
condition. The computation of the CCDF condition which is rather involved appears in the appendix (Section |E2|l. The final 



calculations that combine these results together also appear in the appendix (Section E3 1. Finally, we show in Section VIII-F6 a 
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Lower bound of Theorem 10 
Lower bound of Lemma 8 
Equality line 
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Fig. 9. Illustration of -Rcmp lower bound of Theorem 1 13| and of LemmallOl The achieved rate is plotted against R"jp for n = 100, 000, r = t - 
full list of parameters appears in table |ll| in the appendix. 



: 2. The 



Lemma (which can be considered a corollary to Theorem 13 1, which gives a way to choose the parameters 7, K that guarantees 
a bounded loss within a specified region. 

Figure l9| illustrates the lower bounds of Theorem 



13 



10 



The achieved rate is plotted against R, 



ML f 

cmp ^^^ 



and of Lemma 
n — 100, 000, r = t = 2. The full list of parameters appears in table III] in tHe appendix. Due to the choice Rq — 5 The bound 
of the lemma applies only for i?omp < 5. A comparison between Theorem 13 when specialized to the SISO real valued case 
t = l,r — l,u — Q,d = 1 and the looser results obtained for the same setting in our previous paper |1| appears in t9J. Note 
that with mild values of fi, very small values of Sq are obtained, and thus the resulting input distribution is very close to the 
desired Gaussian distribution. 



A result on non-adaptive achievability stems as a byproduct of CCDF condition required for the proof of Theorem 13 
Lemma 9. Under the definitions of Theorem 



13 



for any 7 < 1 ^ 



t+l+r+u 



the rate function •jR; 



and therefore by Theorem \2\ R, 



ML 

cmp 



) < - log 
n 



1 



l-6r 



1 

n 



iR\ 



ML 

cmp 



(a^q + ^) 



{t+\ + 2r + 2u)-t- log 



is achievable. 



ML 

cinp 

e 



1-7 



has an intrinsic redundancy: 
(253) 



The proof of the lemma appears at the end of Section E2 



4) The trimmed input probability: As noted, the distribution Q is unbounded 



When X —> CXI (in almost every direction), (3(x) - 



Q(x,) 



^x\ 



<i/2g-fxA- 



(254) 



0, therefore it is not bounded from below as required by the conditions of 
Section |VII-F3| To meet the condition we define the trimmed distribution which limits x into a an ellipse define by a radius 

Bfj = {x: x*A;^^x< 17^} (255) 



Q is the conditional density of x given that it belongs to BiY. 

Q(x) - ^^^'^(^ ^ ^^^) 



0(x) 



(256) 



Q{Bn} 

In the case of a white input A^ — Itycti this bounds the peak power of each input vector (which makes sense from a practical 
point of view). Q {Bq\ can be easily evaluated. Since according to Q, d • x*A3f^x is distributed y^ with d ■ t degrees of 
freedom (it is the power of the white vector ^/d ■ K^ x, which has Gaussian i.i.d. entries, where the factor ^fd for the 
complex case normalizes the variance of the real and imaginary parts to 1 rather than \) 



Q {Bq} = 1 - Pr {<ix*A3^^x > dn"^} = 1 



r(f) 



l-Sn 



(257) 



^fc 
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where r{t) is the gamma function, and r(t, s) is the upper incomplete gamma function. Sn decays exponentially to when 
il — > oo. Therefore we have: 

g(x) = Ind(x eBn)- j- ■ g(x) (258) 

Below we address some properties of Q and differences that arise from substituting Q instead of Q. For the trimmed 
distribution Q we have that Q{x) E {0} U [^min , 'Zmax] where: 

g,„i„ = mill Q(x) = rr^ M^AxP''^' min 6"^"^^'"* = ^-^ \dTrAx\"'^^ e"^"' (259) 



xGBs; 



M^AxP''^' min e-^"^i'"* ^ ^ 

1 - (5o ' xesn 1 - do 



gmax = max Q{x) 

xG-Bn 



1 



|d7rAxr''^%iaxe-3^^x'''* 



1 



MttA 



X 



e 2' 

d/2 



(260) 



1 - (5o xesn 1 - (Jq 

We defined the quazi empirical entropy Hq ( |240| i and the rate function in p43| l using Q, but the results of Lemma 7 
and Theorem IT] apply to rate functions defined using the true input distribution Q. However since for x g Bq we have 
Q(x) > Q{x), we have: 



iJQ(X) = — log Q(X) = — log 



rQ(X) 



\og{l - 6n) + H^{X) 



(261) 



And therefore there is a loss of log(l — Sn) in the rate. 

In the sequel, we compute the expected value in the Markov CCDF condition of Theorem IT] It is convenient for the sake 
of this calculation to assume X ~ Q (i.e. is Gaussian) rather than X ^ Q. There is a simple relation between the expected 
values in this case. For every non-negative function ^(x): 

Eg(x) = / Q(x)g(x)dx = -^ / g(x).g(x)dx < -^ / Q(x)g(x)dx = ^^E.g(x) (262) 



xeSn 



SI JxeBn 



^n Jxe 



30 Q 



5) The su mmabilit y condition: We use Lemma It] in order to prove the summabiUty condition. In our case 9 — [A,b, A^|j^] 



(see Section 
the per-letter 



probab 



VIII-Fl I. As we saw, the ML estimate of Kx\y is Cx|y and Pml(X|Y) 



lity satisfies (|227|: 



d7reCx|Y 



-Pmax(^) = maxFe(x|y) = max \dT:K^\y 

^ ■" x,y ' 



-i g-f (x-yA-b)A-i^(x-yA-b)* 



x,y 

IdTrA 



. On the other hand. 



(263) 



^\v\ 



where x, y are single rows of X, Y (single symbols). We can observe that knowing Pml(X|Y) determines 

relates to Pmax(^)- 

Referring to Lemma IT] we have: 



A. 



x\y 



e(A")(t) = {0„,(x|y) : PML(x|y) < t} = lkA^\y 



dneA 



x\y 



'}-i 



<t)- = \9 : IdneA 



^\y\ 



<t 



and this 



(264) 



5o(V'o) = max -Pmax(^) 



IdnA 



IdnA 



x\y\ 



Id-KeAj. 



"^ <9Sax-('A?)'''^ 



x\y\ 



e^ g„ 



= max 

Therefore by the lemma, the summabiUty condition in Theorem |7] holds with 

d 

7 -logic) +7 -log 



W)^ 



foii^o) = 7 • log (5o(V'o") • l7uln) = ^^7 • log(e) + 7 . log ^^ + 7 • log ((^0")^) 

^ '7m in ^ ^ 



™^ ^t^ . log(e) + ^0^7 • log(e) + ^ • log(^«) = ^{t + n')j ■ log(e) + ^ • log(V'o") 



The proof of Theorem 13 is finalized in the appendix (Section E3 



(265) 



(266) 
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6) Selection of parameters for finite n by approximate optimization: The rate i?cmp defined in Theorem 13 has a rather 
complex expression and it is not clear how to select the parameters. Below, we present a coarse way to choose these parameters 
by trying to minimize the main loss factors. We assume il is fixed, and so are the overheads related to it, and focus on K, 7. 
For various values of K^-f we obtain different curves, none of which is uniformly better than others. The loss with respect 
to i?^p determined by ( |322| i increases with -R^p, therefore it makes sense to optimize for all rates up to a certain value 



^cmp = ^0- In the appendix (Section E4i, we develop a coarse bound for the rate loss in the region < R"^p < Ro, and 
minimize the bound. This results in the following Lemma: 

Lemma 10. Under the definitions of Theorem \l3\ let Rq > 0, and select 7 = 1 — y^^, K = \(n- ^/oq ■ Ro)^~\, where 
Qq = log n + ai + a2 + (03 + 04) (_Ro + 0.5), then 

yt e [0, Ro] : F{t) >t-So-ao (267) 

1 - - 
where Sq = 3n^ 3 a^ R^ + ^ 

IX. Comments & further research 
A. Comparison with previous results and techniques 

The asymptotic adaptive and non-adaptive achievability of the empirical mutual information and the second order rate 



function of Theorem 13 (when particularized to the real valued SISO case t = r = l,d = l,u = 0) was shown in the previous 
paper pl. The current results are improved in many senses compared to the previous results (although are also inferior in other 
aspects). Due to space limits, the reader is referred to ||9l for a detailed comparison. 
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Appendix 
A. Proof of the properties of intrinsic redundancy 



In this section we prove the two properties of intrinsic redundancy presented in Section IV-A 

Proof of property 1: The intrinsic redundancy increases linearly when an offset (5 e M is added to (or subtracted from) the 
rate function: 

fJ-qiRcmp + S)= sup I - logg{i?cmp(X,y) + S > R} + R 
y,R U* 

«'=J-*' sup ( - log Q{i?cmp(X, y) > i?'} + i?' + 4 ^^^^^ 

y,R' V^ ) 

= l^Q{Rcinp) + S 

Proof of property 2: by the union bound: 

Q{ max i?empfc >R}=qI [j (i?empfc > R ) \ < Y. Qi^-'^Pk > R} < ^ max Q{i?empfe > R} (269) 

ke{i^...,K} \ke{i.....K} I k=i ke{i,...,K} 



flQ ( max i?empfe ) = sup <^ - log Q{ max i?empfe > R} + R 
\ke{l,....K} / y,B. in k£{l,...,K} 



< sup < — log 
y,R I" 



K max QiRcniDi > R} 

ke{l,...,K} *" 



R 



sup j - log [if Q{i?cmpfe > R}] + r\ (270) 

y,R,k {n J 

sup I - log [g{i?empfc > R}] +r\ + i^^^ 

yMMe{i,....,K} [n ^ -'J n 

^ log(if) 

max ^J.Q[Rcmpk) H 

D 
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B. Achievability of good-put functions for rate adaptive systems [UNSFINISHED] 



In Section |IV-E| it was shown that good-put functions (defined therein) for fixed-rate systems, are asymptotically achievable 
rate functions. Here, the result is extended to good-put functions of rate adaptive systems. Notice that it is not shown that 
these functions are adaptively achievable. 



The same derivation of Section IV-E is followed, while conditioning on R^y^. Consider the conditional form: 

i?good(x, y|i?sy3) = E (1 - e.ye)^3ys X, y, R,. 



(271) 



Since R^y^ = R^y^{S,y), and y is considered constant, this conditioning only affects the distribution of S (and not of X and 
m). Thus, it still holds that ( pT) : 

-Rgood(x,y|i?,yJ^ 



exp(— ni?g. 



E 



R.. 



Or, in other words: 



E 



-Rgood(x,y|i?,yJ 



R. 



'Pr(X = x|i?,,J. 



R,y, exp(-ni?,y,) 



(272) 



(273) 



C. A binary on/off channel 



In Section VI-B we mentioned the binary on/off channel as an example for a non-ergodic channel, where the rate that can 



be achieved on average (adaptively) is larger than the rate that is achieved in worst case (the Han-Verdii capacity). Here we 
complete the example by analyzing the information density of this channel. 

The channel may be in one of two states, which are determined by a single random drawing with equal probabilities - either 
the output equals the input for j = 1 , . . . , n, or it is independent of the input. The information density of this channel, for 
uniform i.i.d. input, is a random variable taking values close to 0, 1 [bits] with equal probabilities, as shown below. 



Pr(X) 

Pr(Y|X) 
Pr(Y) 



1 

2" 

1 

2" 



1 

2" 



(274) 

(275) 

(276) 
(277) 



n Pr(Y) n ^^ z/ . 

^ fl Pr^i.l + i.^^ rilog(i2".l + 

\0 o.w. Ul°g(^) 

= _i + /^l°g(2" + l) Pr=i(l + 5L) 
n 10 o.w. 



Pr 



o.w. 



Ml + ^ 



(278) 



^\0 Pr=i 

Therefore the liminf in probability of i is 0, and therefore we see also by Han-Verdii formula that the Shannon capacity of 
this channel is (which is clear from operational perspective). 

Note: the reason that E[i) < | is that some information is lost due to not knowing the channel state /(X;Y) < 
/(X;Y|State) = \. 



D. Proof of Lemma p] 

Assume _R*jj (x,y) = ^ log q}^-} — (5 is achievable. For every 7 G (0, 1), by Lemma 1 one has 

E[exp(n7i?:^p(X,y))] < ^^_^^^-^_^^ 



(279) 
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On the other hand 



E [exp(n7i?;„,p(X,y))] = cxp{-n-fS)E [cxp(ni?*,„p(X,y)) • cxp(-n(l - 7)i?*„,p(X,y))] 



> exjp^—njS) E 



/(x|y) 
Q(x) 



• exp(-n(l - 7)i?max) 



(280) 



=1 



Combining with ( |279| l we have: 
Which yields after rearrangement: 



= exp{-n'y5 - n(l - 7)i?niax) 
exp(-n7(5 - n(l - 7)i?max) < 



1 



6> 



(l-e)(l-7) 
log(l - e) + log(l - 7) ~ n(l - 7)i?max 



717 



(281) 



(282) 



To approximately maximize the RHS with respect to 7 (in fact, to maximize log(l — 7) — ri(l — 7)i?max) we set 7 = 1 — ^^^^ 
and obtain: 

log(l - e) - log(ni?„,ax) - 1 log(») + log T^ 



(5> 



n — R„ 



n — R„ 



which proves the Lemma. 



(283) 

D 



E. Completion of the proofs for the Gaussian MIMO case 

In the below we give the detailed derivations to complete the proofs of Theorem 13 and some related results that appear in 
Section lylim 



1) Optimal linear estimator without an additive factor: In Section VIII-Fl we presented a conditional probability density 
for the Gaussian family P28| l, which includes a linear estimator of the form Ay + b. The maximization of ( |228| l over A, b 
was solved using an LMMSE estimator ( |229| l. For the case where b = (u = 0), i.e. the estimator is required to be of the 
form Ay, we claimed the same solution holds, where /ix. Ay are replaced with zeros. Here we provide a proof of this claim 
(which follows the same proof as the optimality of MMSE estimator). 



Lemma 11. The matrix A minimizing (X — YA)*(X — YA) (in matrix sense) is 

A = (Y*Y)-iY*X 
proof: The matrix A defined above satisfies the orthogonality criterion: 

Y*(X- YA) =0 
Consider a different matrix A and write: 

(X- YA)*(X- YA) = [(X-YA)+Y(A-A)j* [(X - YA) + Y(A - A 

J285I 



X - YA)* (X - YA) + (A - A)* Y* Y(A - A) 
> (X-YA)*(X^ YA) 



2) The CCDF condition: Based on Section [Vn-F3| let: 



V'(x^Y^j) 



Q(x^i) 



^(Xf+i,Y^+i,0) 



(284) 
(285) 

(286) 

D 
(287) 



Note that ip is of the form ( |164| i, where some dependencies were removed due to the i.i.d. nature of the distribution Pg. Note 
that V'lX'^, Y*^, j) (recall: the metric at time k for the block which started at time j + 1) is dependent only on X*^_|_j^, Y^_j^]^, 
i.e. the values of the channel input and output inside the block. The Markov sufficient condition of Theorem [7] is: 

''k -xrk 



E[^(X^Y^J)|XJ"] =E[^iX.';+,,Y^+,,0)] <Lu-, 



(288) 



For brevity we define ra — k ~ j, and the matrices X = Xj^^j^, Y — Y*^_|_j^ of sizes m x t,m x r respectively. We have 
E [ViCXJ^^^i, Y*^_^i,0)] = E ■(/'(X, Y,0) . Using ( |262] l, we bound, instead, the following value: 



Lji 



V'(X,Y,0) 



= E 




(289) 
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and therefore for the rest of this section we assume X has a Gaussian distribution. 

" —1/2 ~ 

We define V = XAj,^. as the whitened version of X: the elements of V„ixt are independent unit variance Gaussian 
(/complex Gaussian) random variables. To calculate Lm it is convenient to present p^L I ^1^1 by a way of sequential 

projection of the columns of V on the subspaces created by Y and the previous columns. The concept is the same as was used 
in the conference paper ||5], but the details slightly differ mainly due to the different rate function (-Romp rather than i?^*). 



We define the combined matrix Z 



mx (u-\-t~\-r) 



^ [1„,Y,V], Where 1„ ^ 



■-mxl 



i.e. for u = the vector !„ is 



u^ 1 

an empty vector and is excluded from Z. By QR decomposition we can write Z = Q^ • R^ with Q*Qz ~ I and R^ upper 
triangular. As a reminder, QR decomposition is performed by Gram-Schmidt process. We start from the left column of Z and 
work our way to the last one. At each time we take a column of Z and split it to the part which can be represented by a 
linear combination of the columns to the left of it (equivalently, to the columns of Q^ that were already generated), and the 
"innovation", i.e. the part which is orthogonal to the subspace generated by the previous columns. The vector representing 
the innovation is normalized, and becomes the respective column of Q^, and its power becomes the diagonal element in R^. 
The coefficients representing the part of the vector which is in the subspace of previous columns become the elements of R^ 
above the diagonal. Another important property of QR decomposition is that the determinant of Z*Z can be written in terms 
of the diagonal elements in R^: |Z*Z| = |R*Q*QzRz| = [R^R^I = {Hzf = HLi \Rztif- For this equality to be correct 
in the complex case we define the operation |-| to imply absolute-determinant. 

We may split the matrices Q^jR^ into several parts, matching the separate matrices 1 



, Y. V as follows: 



Z-[1„,Y,V] = [ Qi I Q,^|i I Q„|^i ] 



R„ 











JVli 
R 



■v\v 



R,, 



(290) 



Where the blocks dividing the matrices QzjR^ have sizes u,r,t (respectively), R„ and Rj^ are upper triangular and Qi = 

[^ -1 u = 1 , . 

< V™ is just the normalization of the vector 1„ (when u — the first row and column of the RHS of p90b 
\[0] u = ^ ^ ^ 

are absent). The matrices Qi, Qj,|i, Qt,|j;i contain orthogonal columns. The meaning of P90| l is that each column of V is 

represented by its projection on 1„ (which is the mean of the rows, up to a constant), it's projection on the subspace defined by 

the rows of Y and on the previous columns of V, and finally by a new element which is orthogonal to the previous subspaces. 

We can write: 



1 



r„|i + Q^,|iR. 



V — 111 , — r„|l + Q,y\l'R-v\yl + Q)j|ylR.i> 



X = VAj. = 1, 



r^|lA^ + Qy\l^v\yl-^X + Qv\yl^v-^x 



(291) 
(292) 

(293) 



We would like to show that Pml(X|Y) can be written as a function of the diagonal elements in R^ alone. This can be 
proven in a technical form simply by substitution of ( |291| l,( |293]l into the expressions in Lem ma [8] but an alternative proof that 
shows the fundamental reason for that is by recalling that Pml(X|Y) maximizes Pg given by p28| l. In maximizing Pg we first 
find the best linear approximation of X by Y and 1„, and then the covariance matrix of the remainder (error). Clearly the 
best approximation of X by Y and 1„ is in the subspace spanned by !„, Qj,|i, which is described by the first two elements 

in p93| l and therefore the error is the remainder Q„|yiR„A^. we obtain 



Substituting into (|289b we have: 



Pml(X|Y) 



dne 



A 



X 



IR, 



I —dm 



(294) 



Lr] 






I cZtTI: 



A; 



R-7J 



-dm 



IdnA 



x\ 



-ftr(X*XA^i) 



(- 



jjtm 






-'ydm -^-v V,- 

' p 2 ' II " J II 



E 



(-) 



-%^tm 



n 



•E 



IR, 



|-7rfm ^|7tr(V*V) 



J 7m 



R„ 



-7dm |7||vi|| 



(295) 
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where v^ is the i-th column of V. Since v^ are independent is are isotropically distributed (since their elements are Gaussian 
i.i.d.), the innovation norms Uvu are independent. Recall that R^,^ is the norm of the innovation of v; with respect to the 
subspace spanned by 1„, Y and Vi, . . . , Vj_i, however because v, is isotropically distributed, this power is independent of 
the specific subspace in question, and only depends on the dimensions of the subspace. Formally, consider the squared norm 
of the innovation of a ?7i x 1 vector of Gaussian (/complex Gaussian) i.i.d. random variables v with respect to a k dimensional 
subspace spanned by the unitary matrix Vmxk, i-C- P — \W — UU*v|p. Completing U to an orthonormal basis Umxm, and 



defining w = \J* ■ v, we have that U*v = w^, and UU*v = UwJ' = U 
written as 



. Therefore the innovation norm can be 



P 



U w 



Ofcxl 



"fc+ii 



(296) 



Since w has the same distribution of v, the distribution of p does not depend on U. Furthermore p ■ d is distributed x^ with 
d-{m — k) degrees of freedom (the multiplication with d is needed in order to normalize the real and imaginary to unit power). 
Therefore Rt,^j are independent and are distributed Xd-tm-i)- Furthermore, ||vj|p in ( |295| l can be replaced by ||wi||^ (where 
Wi is the vector V; rotated according to the same subspace), which are also independent. Therefore the expected value in ( |295[ ) 
can be written as the product of expected values 



L,n = E 



Ha 



nE[A] 



(297) 



It remains to bound this expected value. The z-th column of V that generates R^ j^ is projected into ak = {i — l)+r + u 
dimensional subspace (i — 1 previous columns of V, r columns of Y and an all-ones vector if u = 1). In the below we take 
w to be the rotated version of v, : 



(-) EA 



R,, 



--fdin^^fWviW 



E 



i-\-r-\-u I 



E 

S=d||wJ'5_ 



1 J 
S 

1 



e2 



is 



-7dm^d^,[ 



E 
S=d||wr'^ 



(298) 



e2 



is 



for general k,a: 



E 



S-"e2T"5 



s=Q 



2k/2Y (I 



-ds 



2'=/2r(|)7,=o 
.=1(1-7). (i(i-^))^ 



■2(l-T)^ds 



2/c/2r(|) 



h^- 



^-h 



dh 



/i=0 



(*) 



r(| 



2" .(1-7)3 



r(l) 



(299) 



where (*) is by definition T{z) = Jf^^Q h^^^ ■ e^^dh, and in order for the integral to exist (near ft, = 0) we need to assume 
I - 1 - a > -1, i.e. a < |. 

Substituting this in P98| ) (a — ^-fdm, k = d{m — i + 1 — r — u) for the first expression and a = 0,k — d-{i — 1 + r + u) 
for the other) we have: 



E [A] = - 



?7m 



- ^jdm 



— u) for the first expression and a = 0,k — d-{i — l 



22 



'J dm 



(1-7)- 



d(^_i + l_^_^) 



j-ydm 



d{m—i-\-l — r—u) 
2 



) (1-7)"^^^^ 



dm 
~2^ 



d((l-^)iyi-(i-l+r+u)) 
2 



(300) 



(1-7)^^ -r 



(^ 



{m—i+l—r—u) 
2 



Where to meet the condition a < | we need to require ^^dm < ^d{m — i + 1 — r — u)=>7<l— ' ^^[^^ ■ Since this must 
hold for any i = 1, . . . , i, this implies 7 < 1 — *^''^"~^ . Recall that in order to have a decreasing redundancy in Theorem [7^ 
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(see for example Corollary 7.1i we need to have -logL„ — > 0, which implies in our case —log £[£),] — > 0. This is not 
immediately clear from P00| i. We use Stirling's approximation for Gamma function: 



r(z) = ^/2^z'=~2e"^+TK, < rj < 1 



For brevity we define zi — ^, Z2 = (1 — 7)21, Z3 



d{i-l+r+u) 



Z\ ■ 



■i— l+T+'U. 



(301) 
Under our assumptions, zi > Z2 > Z3. 



We further assume Z2 ~ ^3 > 1- 

-p /d((l-7)m-(i-l+r+«))\ 

^ V 2 J _ r (22 - ^^3) 



r 



/ d{ra—i-\-\~-r—u) \ 

\ 2 ; 



< 



r(2:i -Z3) 

27r(z2 - 23)(^2-2:3)-2e-(^2-^3)+i!j(^2-^3) (^^ ~ Zg)^^^"^^)- 5 e^l-^2 g^ 



/2^(zi - z3)(^i"^-^'"2e-(^i-^3) 



< 



(^l-^3)^^^"^^'"2 



(-i--3)(---)-^-(JtEt|) 



(^2-^3)-o 



^ei2 



(Zl - Z3)^^^"^-^^" 



(302) 



(a) 

< (Zi - Z3)^^-^l 



dm 



1 



^2 
Z\ 

z — 1 + r + M 



(z2-Z3)-o 



2 V 771 



(1-7) 



^-^ '/ 2 2 /:j ' 2 ^ 12 



dm\ 



dm(l-T) / i— 1+r + Tt 
(1-7) ^ • 1 



-27m 



(1-7)- 



d(» + r+n) 1 



Where in the last inequality (a) we used ^a— ^ < ^ (which stems from zi > Z2), and under the assumption Z2 — Z3 > 1 
the exponent (z2 — Z3) — ^ is positive. This condition implies (1 — 7)™ > | + z — 1 + r + u, so it is sufficient that 
(1 — 7)m > i + l + r + u. Note that the two first terms cancel out respective terms in P00| l and the last two terms are 
independent of vn. The term (l — ''^^^^^^ ^ ^ tends to £('"^+''+"'2')' as m ^- 00. For finite m, using ln(l + a;) > j^ 
we have: 



In 



1 



i — 1 + ?■ + u 



ra 



< --7m 



2 ' 1 i — l+r+u 2 



7(1 — 1 + r + u) 



1- 



i— 1+^'+^ 



(303) 



< —{i — I + r + u) 



substituting (|302| and ([303) in ( |300l ), 

E [A] < e5(*-i+-+") . (1 - jy'-^'^^e^- 
Where we have assumed (1 — j)m >i + l + r + u. Substituting into ( |297| i we obtain: 

i™ = riE [A] < e^ ^'-(-1+^+") . (1 - 7)-^^^M^±^e3 



(304) 



K5(*-l)+''+")*.(l_7)- 



;(t + l)+r+u t 



ei2 



< e-i 



f (t+l+2r+2n)t 



(1-7)" 



d(t + l + 2r+2u)f 



1-7 



|(t+l+2r+2«)-t 



Note that we obtained a constant bound on L,n that does not grow with to. 
Returning to ( |288] l (recall that m = fc - j, X = X*^_^i, Y = Y^^^^^): 



E[V'(x^Y^J)|x^■]=E 7^(x,y,o) 



--L™ < 



1 



l-'^o 
1 / e 



E 



1 - (^n 1 - 5n V 1 - 7 



V'(X,Y,0) 

f(t + l+2r+2«)-t 



_L„ 



(305) 



(306) 
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P06[ ) defines L,n under which the CCDF condition of Theorem |7] holds, and i„i is non-decreasing as required. To satisfy the 
assumption (1 — 7)m > i + 1 + r + u for all i < f, we define bo ~ t+v+r+u ^^ ^^^ minimal symbol for which the bound 
holds (see the definitions of Theorem [7]). 

The CCDF condition directly yields the result of Lemma l9] from the CCDF condition we have that the intrinsic redundancy 
of 7-Rcmp satisfies: 



mq(7^; 



ML \ 

cmp / 



-logL^t.„ = -logE[exp(n7i?cmp(X,Y))] 
n n Q 

-logE[^(X",Y",0)]<-logi„ 
n Q n 



-log 
n 

1 



1 



f{t+l+2r+2u)-t 



= ~ l°g I 1 X 



1 - ^n V 1 ^ 7 
I \ 1 d 



(307) 



(t+l + 2r + 2u)-t-\og. 
n 4 V 1 - 7 



The condition on 7 is obtained by the requirement to satisfy the conditions of P06| l for m = n. 

3) Proof of Theorem 13 In this section we wrap up the proof of Theorem [13] by combining the results together. From 

f (i + l+27+2«)-t 



3061) we have that the CCDF condition holds with i„, = 



and &r 



_ t+l+r+tt 



Substituting this 



l-<5n \l-l } "" 1-7 

and the summability condition with /q defined in ( |266| l in Theorem IT] we have that the following rate function is adaptively 
achievable: 






1 



iog(V^S 



K 

n 



(308) 



with c„ — log ^T-^ and bi = bo + 2dpB — 1- We have 



- ■ log(V'o ) ^^ 7 



7^Q(X)-i/ML(X|Y) 



■j\og{l~Sn) < jR. 



Hq{X) - iJML(X|Y) +7log(l - 5n) 



ML 

cmp 



Where Rf^p is defined in ( |243] l. 
Substituting we obtain: 



, . (266). (309) W 

/,i"'(^o") ^ ^(^ + ^')7 • log(e) + 7i?, 



ML 

cnip 



, n- Ln , 
c„ = log — = log 



An) 



Cn+b^-f^'"ii'^)<l0g 



n 



rfFBe(l - Sn) 4 



+ - (t + 1 + 2r + 2u) • i • log 



d I e 
-: p^ + - (t + 1 + 2r + 2u) • t • log 



1-7. 
t+l+r+u 
1^7 



2dp„ - 1 



(309) 

(310) 
(311) 



-(i + 172)7.1og(e)+7i?, 



ML 

cmp 



>frii'S) 

= log n + fli + a2 log 



03 



1 — 7 \1 — 7 



04-7 [R. 



cmp 



as) — An.'y • R, 



7 ''cmp 



^n.") 



(312) 
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where 



flo 



ai 



log 



l-Sn 



ao + log ■ 



0-2 log(e) 



(313) 
(314) 



0-2 
Os 

05 



-(t + l + 2r + 2u) -t 
4 



= t+l+r+' 



2d^ 



1 



2(t + ^^) 



log(e) 



Ar. 



B„ 



7 



03 



1 
logn 



■7 



04 



02 log 



03 



+ 04 • 7 • 05 



We may lower bound the achievable rate i?cinp P08| ) by: 



Rr 



> 



1 



4 . /? 



cmp 



B 



n,7 



4 . ?? 

-^n,7 -'I-, 



■cinp 



K + B.^ 



L /^""omp 
1 

■1-R. 



--l\og{l~ 6n)\ 



ML 

cnip 



ao 



K 

n 



K 

n 

v-R. 



(315) 
(316) 
(317) 

(318) 
(319) 

(320) 
(321) 



(322) 



einp 



"'cinp 



where 






S = 



K 

ao - 



n,7 
Bn. 

K 

n 



(323) 
(324) 
(325) 



This shows the main results of the theorem. 

In order to show asymptotic achievability we need to show there exists a choice of 7, J7 and K as functions of n such that 

^ ^ 0, 7 — > 1 and ao — > while — SA , —jf^ — y q. Examining these 

n— foo n— foo n— >-oo 

= 1.44 and results from h\{x)/x < e~^ 



rj — > l,a,S — > 0. This requires that 



expression we observe it is sufficient that 
4) Proof of Lemma 



10 



,._ .„ — > 0. A possible choice \s K = 

\^ 7J/I ^^_^QQ 

Using log(a;) < x (this is true for log of base larger than e^/ 



which can be proven by derivation) and assuming -R^p < Rq we may coarsely bound An.-y 

1 / 83 + 04 



pAIL 

-^cmp 



B„,.., in d322b by: 



dml 
-^cmp 



Bn,^ < log n + fli + 02 



1-7 



1-7 



-^cmp 



05 



< 



logn + fli + 02 + (as + 04) (i?o + 05) 



(326) 



■7 



ae 

1^ 



Using ^ 



> 1 



> 1 



and ( |322| i we write (for i?, 



ML ^ D \. 

omp S -f^o^ 



(l-7)X 



■1-R. 



ML 

cmp 



K 



ao 



— -'''cinp 



(1 - 7) • i?0 + 



ae 



l-7)i^ 



i?o 






'flo 



(327) 



Y^ • i?,o we choose (1 — 7) = ^J^ (see Lemma 61 



We now choose ^,K that minimize (5o. To minimize (1 — 7) • i?o + ii_ ) j^ „ , ^ v j\ - 

and obtain (1 — 7) • i?o — ^_°''sjy • Rq ~ '^\/W ' ^o- Following, K is chosen to minimize 2J^ ■ Rq + ^ which yields 
K — (n ■ y/oa ■ i?o) ^ . This value is rounded up to an integer value, incurring an additional loss of at most ^. Substituting we 
have 2^y^■RQ + ^ ^ 3n^3a|i?o . Accounting for the additional loss of ^ due to rounding K, we have S^ < Sn^^a^R^ +^ 
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TABLE II 

Parameters of the rate adaptive scheme for MIMO (Section |VIII-F^ , for Figure[9] 

Parameters of the scheme used for Figure [9| 

Basic parameters: n = le + 005,t = 2,r =%d = 2,u = l,t = 0.001, f1 = 5,dFB = 1 

Parameters of Lemma [To| Rq = 5, ae = 356 => A: = 4.5e + 004, 7 = 0.911 

Intermediate parameters of Tlieorem|13| ap = 0, ai = 23, 02 = 9, 03 = 6, 04 = 1, as = 39, An^~f ■ 



62, B„ 



:2.5e + 003 



Final parameters of Theore m |13 | S = 0.45, a ■■ 
Final parameters of Lemma 1 1 ul oq = 1.3 
Saturation (limit) of lower bound for ficmp ~^ 



: 0.0013, ri = 0.863, fo = 3.17e - 019 



: 654.56 



5) The intrinsic redundancy: In Example H we claimed that the SISO version of the rate function i?cmp = \ log j^z^ has 



i 



an intrinsic redundancy fiQ{Remp) = 00. This implies of course that also the MIMO rate function has an infinite intrinsic 
redundancy (since the SISO rate function can be attained as a particular case by zeroing some of the inputs and outputs). This 
results from the fact that Pr(i?omp > -R) ~ exp(— (n — 1)R) (instead of exp(— ni?) as required to satisfy the necessary or 
sufficient condition of Theorem [T|. This exponent is already implied by Lemma 4 in the previous paper [1 J, but Lemma 4 is 
an upper bound and to prove that /iQ(_Romp) = 00 a lower bound on the probability Pr(i?cmp > R) is required. Below we 
prove the claim of Example HI using such a lower bound. 

We use the same technique and notation of the proof of Lemma 4 the previous paper fH . There we showed that 



Pr(|/5| >t) = Pt[ X( > 



'• ||y"||2 



(328) 



where x is a Gaussian normal vector of length n, X ^ A/'"(0, 1). IIX2 |P is distributed Chi-square with k = n — 1 degrees of 
freedom. For a random variable V ^ xl (Chi square with k degrees of freedom), one has: 



""'' ^ "> = £0 ¥7mm '""''"'^'" - -<^)/l'""'=-'^<« 



ci(fe) 



Cl(fc) 

k/2 

C2(k) 



,k/2-v/2 



(329) 



In our case: 

Pr(|p| >t)^E 



Pr MIX 



"l|2 



< 



1-t' 



C2(n- 1) 



i2 



-xl 



Xl 



>E 



-halfi 



C2(n- 1) 



\~i' 



t^ 



-Xl 



-halfi 



-Xt 



\2tt) ~e 2^1^X1= C3(n) 

=C2in~l)(2ir)- 



~1^ 



-half{^ 



=/* t{l - <2)T^ C3(n) / ^^-ig-'^'^V^" . dz = C4{n)t{l - f2)^ 



C4(n) 



Therefore 
and 

MQ(^cmp) 



Pr(i?omp > i?) = Pr{|/5| > ^1 " exp(-2i?)} > C4(n) ^1 - exp(-2i?) exp(-(n - 1)R) 

(n) + - log Vl - cxp(-2i?) - 



(330) 
(331) 



sup < — logPr{i?omp > R} + R> > sup < — logC4 



1 



R + R> > -logC4(n)+ lim 
n I n fl-i-oo 

(332) 



The limit diverges because lim/f^oo log \/l — cxp(— 2i?) = log 1 = 0. DThe geometric interpretation of Lemma 4 given in 
the appendix of the paper IT] may also be used to prove the same claim. 



F. The conditional Lempel-Ziv and probability assignments implemented by FSM-s 

that the probability Pi2(x|y) = cxp(— L(x|y)) assigned by the conditional 



Below we prove the claim from Section 



VIII-E4 



LZ to an input sequence, asymptotically surpasses (up to vanishing factors) the probability that can be assigned to the sequence 
by any finite state machine operating on the sequences x, y. For simplicity of notation we will use x, y to denote phrases, and 
the full sequences will be denoted x".y". Although this claim is straightforward and similar claims appear in 1161129] , we 
did not find the exact claim, and therefore we prove it below. 



The state machine with S states. At each symbol it receives yi,Xi, assigns a probabiHty for Xi and moves to a next state 
based on yi,Xi. The total probability is the product of (conditional) probabilities assigned to the letters. It is required of course 
that the sum of the probabilities assigned to different Xj-s (and as a consequence different sequences x) will be 1. 

Let (x;,y;) denote the l-th phrase out of c phrases in the joint parsing of x,y. Suppose the state of the state machine at 
the beginning of this phrase is s/. The cumulative probability assigned by the state machine to the phrase can be written as 
function of x/,y;,s/. Denote the probability assigned to a phrase x given the phrase y with the initial state s as P(x|y, s) 
(this function characterizes the state machine, and must satisfy J2x ^(^IYj ^))' '^hen the overall probability assigned by the 
state machine is: 

c 

^(x"|y") = n^(^'|y''«') (333) 

1=1 

let c/(x|y) count the number of different x; that appear jointly with yi, and Ci(x|y, s) the number of different x; that appear 
jointly with y/ with si — s (i.e. C((x|y, s) — J^i-y =v s =s 1)' ^^^^ looking at the part of the product above associated with 
specific yi and si we have: 

log n F(x,|y/,sO ==Q(x|y, s) • — V logP(x,|y;, s;) 

, ^^ c;(xy,s) ^^ 

'■■■yi=y,Si=s l:yi=y.si=s 

<Q(x|y,g).log| ] YI Pi^ilyi^'A (334) 

< Q(x|y, s) -log 



,Ci(x|y,s) 
where X^;.y ^y ^ ^^ P(x;|y;, s;) < 1 since no phrase xi can appear twice. Hence 

iogp(x"|y") = iogn n P{^i\yi,si)<Y^iiMy,s)-log( ^ ) 



y^s i:y,=y,s,=s y,s 



c/(xy > — — p^-log —- r -> c/(xy •logc/(xy) 

V C'^y Vc/xy,s y ^ (335) 



E 

y 



<logS 

- E '^' (^ly) ' ^°S -S* - X! "^^ (^|y) ■ 1°S "^^ (^ly) = c • log 5* - ^ c; (x|y) • log a (x|y) 
y y y 

where (a) is because the braced expression can be interpreted as the entropy of the probability over s p{s) = ^^ (xf) ^'^'^ 
is therefore bounded by the entropy of a uniform distribution over s = 1, . . . , S. The value ^ Ci(x|y) • logQ(x|y) is the 
conditional LZ complexity. Therefore we have that for any conditional probability P(x"|y") implemented by a finite state 
machine with no more than S states, one has: 

logP(x"|y") <c-logS-CLz{^\y) (336) 

where 

c 

Ciz(x|y) = ^Q(x|y) • logQ(x|y) = ^logQ(x|y) (337) 

y (=1 

is the conditional LZ complexity and c;(x|y) is defined above, and c is the number of phrases in joint parsing of x,y. The 
number of phrases c is bounded by « " id „ 1261 Eq.(6)]. Therefore when considering ^ logP(x"|y") the first term in 
the RHS of 03611 yields an asymptotically vanishing factor '^' °^ — > 0. 

Next we connect CLzi^\y) with L(x|y) obtained by the conditional LZ algorithm. Since the index this algorithm sends for 
each phrase / encodes x; by sending the last letter plus the index of the phrase composed of the other letters out of the c;(x|y) 
phrases with the same y, this requires at most log \X\ + logc;(x|y) + r„ where r„ accounts for the additional overhead due 
to rounding, and the need to encode the length of c;(x|y) (since c;(x|y) < n the length of its encoding, i.e. the number of 
bits logQ(x|y) is at most log log n). Therefore 

i(x"|y") < J2 [loglA-l +logQ(x|y) +r„] = CLz(x|y) + c • (log [A"! + r„) (338) 



Therefore 



-L(x"|y") < -CLz(x|y) + - • (log \X\ + r„) < - logP(x"|y") + - • (log \X\ + r„ + log S) (339) 

n n n n 
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where the factor Sn in the RHS vanishes with n. Plugging this into the rate function ( |199| l we obtain 

1 1 P(x"lv") 

i?emp = log \X\ - -i(x"|y") > - log ), . Z, - Sn (340) 

It I L Key [ -X. J 

I.e. this rate function surpasses up to (5„ all rate functions defined by any P(x"|y") that can be implemented by a finite state 
machine. 
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